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/-Divergence Inequalities 
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Abstract 

This paper develops systematic approaches to obtain /-divergence inequalities, dealing with pairs of probability 
measures defined on arbitrary alphabets. Functional domination is one such approach, where special emphasis is placed 
on finding the best possible constant upper bounding a ratio of /-divergences. Another approach used for the derivation 
of bounds among /-divergences relies on moment inequalities and the logarithmic-convexity property, which results in 
tight bounds on the relative entropy and Bhattacharyya distance in terms of x 2 divergences. A rich variety of bounds are 
shown to hold under boundedness assumptions on the relative information. Special attention is devoted to the total variation 
distance and its relation to the relative information and relative entropy, including “reverse Pinsker inequalities,” as well 
as on the E 1 divergence, which generalizes the total variation distance. Pinsker’s inequality is extended for this type of 
/-divergence, a result which leads to an inequality linking the relative entropy and relative information spectrum. Integral 
expressions of the Renyi divergence in terms of the relative information spectrum are derived, leading to bounds on the 
Renyi divergence in terms of either the variational distance or relative entropy. 
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I. Introduction 

Throughout their development, information theory, and more generally, probability theory, have benefitted from 
non-negative measures of dissimilarity, or loosely speaking, distances, between pairs of probability measures 
defined on the same measurable space (see, e.g., PHI . ll65ll . 1 11051 ). Notable among those measures are (see 
Section |TT| for definitions): 

• total variation distance \P — Q\\ 

• relative entropy D{P\\Q)\ 

• x 2 -divergence X 2 (P\\Q)', 

• Hellinger divergence J^,(P\\Q); 

• Renyi divergence D a (P\\Q). 

It is useful, particularly in proving convergence results, to give bounds of one measure of dissimilarity in terms 
of another. The most celebrated among those bounds is Pinsker’s inequality^ 

±|P-Q| 2 loge<£>(P||Q) (1) 

proved by Csiszaij^] l23l and Kullback |[60l l, with Kemperman ll56l independently a bit later. Improved and 
generalized versions of Pinsker’s inequality have been studied, among others, in |39l , fl44l . |46j], 11761 1. lf86ll . 

ei, m. 

Relationships among measures of distances between probability measures have long been a focus of interest 
in probability theory and statistics (e.g., for studying the rate of convergence of measures). The reader is referred 
to surveys in B4T1 Section 3], Il65l Chapter 2], 1861 and ||87l Appendix 3], which provide several relationships 
among useful /-divergences and other measures of dissimilarity between probability measures. Some notable 
existing bounds among /-divergences include, in addition to ([T]): 

• [[63] Lemma 1], |64] p. 25] 

M’?{P\\Q) <\P - Q\ 2 (2) 

2 

<^(P\\Q)(A-^(P\\Q))- (3) 

. m ( 2 . 2 )] 

1|P_Q| 2 < l- exp (- J D(P||g)); (4) 

1 The folklore in information theory is that 0 is due to Pinsker EDO, albeit with a suboptimal constant. As explained in nm although 
no such inequality appears in mi. it is possible to put together two of Pinsker’s bounds to conclude that ^ |P — Q\ 2 log e < D(P\\Q). 
2 Csiszar derived 0 in on Theorem 4.1] after publishing a weaker version in 1 1221 Corollary 1] a year earlier. 
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• m Theorem 5], Ii35l Theorem 4], Ii99l 

D{P\\Q) < iog(i + x 2 (p\\Q)) ; (5) 

• 08] Corollary 5.6] For all a > 2 

X 2 (P\\Q) < (i + (a - 1) je a (P\\Q)) ^ - 1; (6) 


the inequality in Q is reversed if a G (0,1) U (1, 2], and it holds with equality if a = 2. 
• El, E3, SMI (58)] 



\P-Q\ 2 , \P-Q\€ [0,1] 

x (P\\Q) > 

{ K,. 1 P <31 e ( 1 , 2 ). 

(7) 

. 06J 

DHr\\Q) < , 2(p||g) i . 

D{Q\\P) ~ 2 * 1 g ’ 

(8) 


4^f (P||Q) log 2 e < D(P\\Q) D(Q\\P) 

2 

(9) 


< \ x 2 (P\\Q) X 2 (Q\\P) log 2 e; 

( 10 ) 


4JTi(P||Q) loge < D(P\\Q) + D(Q\\P) 

( 11 ) 


<Hx 2 (P\\Q) + X 2 (Q\\P)) loge; 

( 12 ) 

• m (2.8)] 

-D(P||Q) < 5 (|P - Q| + X 2 (-P||Q)) loge; 

(13) 

• 061, 113 Corollary 32], EQ 



D(P\\Q) + D(Q\\P) >\P-Q\ log ) , 

(14) 


x\P\\Q) + x 2 (Q\\p ) > 4 8 J^_ Q Q| 2 ; 

(15) 

• ll53l p. 711] (cf. a generalized form in |[§7l Lemma A.3.5]) 



^|(P||Q) loge < D(P\\Q), 

(16) 

generalized in 

ll65l Proposition 2.15]: 



Jf a (P\\Q) loge < D a (P\\Q) < D(P\\Q), 

(17) 
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for a £ (0,1), and 

M’aiPWQ) loge > D a (P\\Q) > D{P\\Q) (18) 

for a £ (1, oo). 

• ll65l Proposition 2.35] If a £ (0,1), (3 = max{a, 1 — a}, then 

1- {l + \\P-Q\Y {l-\\P-Q\Y~P 

< (1 - a) JP a (P\\Q) (19) 

<\\P - Q\. (20) 

• lf37l Theorems 3 and 16] 

- D a (P\\Q) is monotonically increasing in a > 0; 

- (^ — l) D a (P\\Q) is monotonically decreasing in a £ (0,1]; 

- [[65] Proposition 2.7] the same monotonicity properties hold for JP a (P\\Q). 

• ll46l If a £ (0,1], then 

||P-Q| 2 log e <P a (P||Q); (21) 

• An inequality for AP a (P\\Q) 16] Lemma 1] becomes (at a = 1), the parallelogram identity [26] (2.2)] 

P(Po||Q) + P(Pi||Q) = £> (Pol I ft) + £>(Pi||Pi) + 2P(Pi ||Q). (22) 

2 2 2 

with Pi = 5(P 0 + Pi), and extends a result for the relative entropy in [261 Theorem 2.1], 

• A “reverse Pinsker inequality”, providing an upper bound on the relative entropy in terms of the total 
variation distance, does not exist in general since we can find distributions which are arbitrarily close in 
total variation but with arbitrarily high relative entropy. Nevertheless, it is possible to introduce constraints 
under which such reverse Pinsker inequalities hold. In the special case of a finite alphabet A, Csiszar and 
Talata ll29l p. 1012] showed that 

D(P\\Q)< f^VlP-QI 2 (23) 

\ ^^min J 

when Q m in = min ae ^ Q(a) is positive^ 

'Recent applications of ( |23[ can be found in (HI Appendix D] and II100 1 Lemma 7] for the analysis of the third-order asymptotics of 
the discrete memoryless channel with or without cost constraints. 


December 6. 2016 


DRAFT 



SASON AND VERDU: /-DIVERGENCE INEQUALITIES 


5 


Theorem 3.1] if /: (0,oo) —>• K is a strictly convex function, then there exists a real-valued function 
■0/ such that linx^o ipf{x) = 0 ancQ 

\P-Q\<^ f {D f (P\\Q)). (24) 

which implies 


lim Df(P n \\Q n ) = 0 


lim \Pn Qn | — 0* 

n—>oo 


(25) 


The numerical optimization of an /-divergence subject to simultaneous constraints on /*-divergences (i = 
1 ,,L) was recently studied in ||48| . which showed that for that purpose it is enough to restrict attention to 
alphabets of cardinality L + 2. Earlier, lf50l showed that if L = 1, then either the solution is obtained by a 
pair (P, Q) on a binary alphabet, or it is a deflated version of such a point. Therefore, from a purely numerical 
standpoint, the minimization of Df(P\\Q) such that D g (P\\Q) > cL can be accomplished by a grid search on 
[0, l] 2 . Occasionally, as in the case where Df(P\\Q) = D(P\\Q) and D g (P\\Q) = \P — Q\, it is actually 
possible to determine analytically the locus of (Df(P\\Q),D g (P\\Q)) (see |[39ll ). In fact, as shown in [ 104 , 
(22)], a binary alphabet suffices if the single constraint is on the total variation distance. The same conclusion 
holds when minimizing the Renyi divergence |92| . 

In this work, we find relationships among the various divergence measures outlined above as well as a 
number of other measures of dissimilarity between probability measures. The framework of /-divergences, 
which encompasses the foregoing measures (Renyi divergence is a one-to-one transformation of the Hellinger 
divergence) serves as a convenient playground. 

The rest of the paper is structured as follows: 

Section [II] introduces the basic definitions needed and in particular the various measures of dissimilarity 
between probability measures used throughout. 


Based on functional domination , Section III provides a basic tool for the derivation of bounds among /- 
divergences. Under mild regularity conditions, this approach further enables to prove the optimality of constants 
in those bounds. In addition, we show instances where such optimality can be shown in the absence of regularity 
conditions. The basic tool used in Section [HI] is exemplified in obtaining relationships among important /- 
divergences such as relative entropy, Hellinger divergence and total variation distance. This approach is also 
useful in strengthening and providing an alternative proof of Samson’s inequality ll89ll (a counterpart to Pinsker’s 
inequality using Marton’s divergence, useful in proving certain concentration of measure results lfl3l ). whose 


4 Eq. {24j follows as a special case of (25] Theoiem 3.1] with tyi — 1 cind uirn — !• 
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constant we show cannot be improved. In addition, we show several new results in Section III-E on the maximal 
ratios of various /-divergences to total variation distance. 

Section [TV] provides an approach for bounding ratios of /-divergences, assuming that the relative information 
(see Definition |T]) is lower and/or upper bounded with probability one. The approach is exemplified in bounding 
ratios of relative entropy to various /-divergences, and analyzing the local behavior of /-divergence ratios when 
the reference measure is fixed. We also show that bounded relative information leads to a strengthened version of 
Jensen’s inequality, which, in turn, results in upper and lower bounds on the ratio of the non-negative difference 
log (1 + X 2 (P\\Q)) ~D(P\\Q) to D(Q\\P). A new reverse version of Samson’s inequality is another byproduct 
of the main tool in this section. 

The rich structure of the total variation distance as well as its importance in both fundamentals and applications 
merits placing special attention on bounding the rest of the distance measures in terms of \P — Q\. Section |v| 
gives several useful identities linking the total variation distance with the relative information spectrum, which 
result in a number of upper and lower bounds on \P — Q\, some of which are tighter than Pinsker’s inequality in 
various ranges of the parameters. It also provides refined bounds on D{P\\Q) as a function of x 2 -divergences 
and the total variation distance. 


Section VI is devoted to proving “reverse Pinsker inequalities,” namely, lower bounds on \P—Q\ as a function 
of D(P\\Q) involving either (a) bounds on the relative information, (b) Lipschitz constants, or (c) the minimum 
mass of the reference measure (in the finite alphabet case). In the latter case, we also examine the relationship 
between entropy and the total variation distance from the equiprobable distribution, as well as the exponential 
decay of the probability that an independent identically distributed sequence is not strongly typical. 

Section VII focuses on the E 1 divergence. This /-divergence generalizes the total variation distance, and 

Based on 


its utility in information theory has been exemplified in liTSl l. Il68l . fl69l . i TTOl . f79l . 
the operational interpretation of the DeGroot statistical information 11311 and the integral representation of /- 
divergences as a function of DeGroot’s measure, Section [VTT] provides an integral representation of /-divergences 
as a function of the E~ f divergence; this representation shows that {( E-, (P\\Q). /V, (Q\\P)), 7 > 1} uniquely 
determines D(P\\Q) and Jt? a (P\\Q), as well as any other /-divergence with twice differentiable /. Accordingly, 
bounds on the f ? 7 divergence directly translate into bounds on other important /-divergences. In addition, we 
show an extension of Pinsker’s inequality (jT]) to E 1 divergence, which leads to a relationship between the relative 
information spectrum and relative entropy. 

The Renyi divergence, which has found a plethora of information-theoretic applications, is the focus of 
Section VIII Expressions of the Renyi divergence are derived in Section VIII-A| as a function of the relative 
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information spectrum. These expressions lead in Section VIII-B to the derivation of bounds on the Renyi 
divergence as a function of the variational distance under the boundedness assumption of the relative information. 


Bounds on the Renyi divergence of an arbitrary order are derived in Section VIII-C as a function of the relative 
entropy when the relative information is bounded. 


II. Basic Definitions 
A. Relative Information and Relative Entropy 

We assume throughout that the probability measures P and Q arc defined on a common measurable space 
(A, -X), and P -C Q denotes that P is absolutely continuous with respect to Q, namely there is no event T G & 
such that P(P) > 0 = Q(P). 

Definition 1: If P <C Q, the relative information provided by a £ A according to ( P,Q) is given hyj^] 

dP 

7 P || Q (a) = log —(a). (26) 

When the argument of the relative information is distributed according to P, the resulting real-valued random 
variable is of particular interest. Its cumulative distribution function and expected value are known as follows. 
Definition 2: If P -C Q, the relative information spectrum is the cumulative distribution function 

f P||q(zO = P [*p||qP0 < x], (27) 

witfj^ X ~ P. The relative entropy of P with respect to Q is 

D(P\\Q)=E[i p{IQ (X)} (28) 

= E[*p||qQ 0 exp(* P ||Q(y))], (29) 

where Y ~ Q. 


B. f-Divergences 

Introduced by Ali-Silvey |j2l and Csiszar ll2Tl . l(23l . a useful generalization of the relative entropy, which 
retains some of its major properties (and, in particular, the data processing inequality 1111211 ). is the class of 
/-divergences. A general definition of an /-divergence is given in li66l p. 4398], specialized next to the case 
where 

denotes the Radon-Nikodym derivative (or density) of P with respect to Q. Logarithms have an arbitrary common base, and 
exp(-) indicates the inverse function of the logarithm with that base. 

6 X ~ P means that ¥[X £ J] = P(E) for any event T £ &. 
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Definition 3: Let /: (0, oo) —> R be a convex function, and suppose that P <C Q. The /-divergence from P 
to Q is given by 

D f (P\\Q) = I f (g) AQ (30) 

= E[f(Z)\ (31) 

where 


Z = exp(zp||g(Y)), y~Q. 


and in (30 1 , we took the continuous extensiorQ 


(32) 


/(°) = lim/(f) <E (-oo,+oo]. (33) 

We can also define Df(P\\Q) without requiring P « Q. Let /: (0, oo) -> R be a convex function with 
/(1) = 0, and let /*: (0, oo) — > R be given by 


/*(«) = (/(!) 


(34) 


for all t > 0. Note that /* is also convex (see, e.g., [14] Section 3.2.6]), /*(1) = 0, and Df(P\\Q) = Df* (Q\\P) 
if P <S3> Q- By definition, we take 

/*(0) = lim f*(t) = lim (35) 

t/0 u-¥ oo u 

If p and q denote, respectively, the densities of P and Q with respect to a c-finite measure // (i.e., p = 
q = ^j), then we can write ( |30| ) as 

Df(P\\Q) = Jqf Q d/i (36) 

= / qf(~) dp + /(0) Q(p = 0) + /*(O)P(g = 0). (37) 

J {pq> 0} V Q ) 

Remark 1: Different functions may lead to the same /-divergence for all (P. Q): if for an arbitrary h G M, 
we have 


fb{t) = fo(t) + b(t- 1), t>0 


(38) 


then 


D fo (P\\Q) = D fb (P\\Q). 


(39) 


7 The convexity of /: (0,oo) — R implies its continuity on (0,oo). 
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The following key property of /-divergences follows from Jensen’s inequality. 
Proposition 1: If /: (0, oo) M is convex and /(1) = 0, P -C Q, then 


Df(P\\Q) > 0. 


( 40 ) 


If, furthermore, / is strictly convex at t = 1, then equality in ( [40] ) holds if and only if P = Q. 

Surveys on general properties of /-divergences can be found in 1551 , 111051 . f 1061 . 

The assumptions of Proposition [T] are satisfied by many interesting measures of dissimilarity between probabil¬ 
ity measures. In particular, the following examples receive particular attention in this paper. As per Definition [3j 
in each case the function / is defined on (0, oo). 

1) Relative entropy 1591 : fit) = t logf, 


D(P\\Q) = D f (P\\Q) 

= D r (P\\Q) 

with r: (0,oo) —> [0, oo) defined as 

r(t) = t logf + (1 — t) loge. 

2) Relative entropy. ( P <C3> Q) f(t) = — logf, 

D(Q\\P) = D f (P\\Q)- 

3) Jeffrey’s divergence tf54l : (P <C3> Q) /(f) = (t — 1) logf, 

D(P\\Q) + D(Q\\P) = D f (P\\Q); 

4) x 2 -divergence 11771 : /(f) = (f — l) 2 or /(f) = f 2 — 1, 

X 2 (P\\Q) = D f (P\\Q) 

x2(p|l ' 3, = /(^- 1 ) 2d0 


dP 
d Q 


dQ-1 


= E[exp(2*p||g(y))] - 1 
= E[exp(t P ||Qp0)] - 1 


with X ~ P and Y ~ Q. Note that if P yjyy Q, then from the right side of ( |48| ), we obtain 

X 2 (Q\\P)=Dg(P\\Q) 


(41) 

(42) 

(43) 

(44) 

(45) 

(46) 

(47) 

(48) 

(49) 

(50) 

(51) 
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with g(t) = \- t, = and /(f) = t 2 - 1. 

5) Hellinger divergence of order a E (0,1) U (l,oo) ll54l , ll65l Definition 2.10]: 

JP a (P\\Q) = D fa (P\\Q) 


with 


The x 2 -divergence is the Hellinger divergence of order 2, while ^JPi(P\\Q) is usually referred to 
squared Hellinger distance. The analytic extension of Jtf’ a (P\\Q) at a = 1 yields 

M(P\\Q) loge = D(P\\Q). 

6) Total variation distance : Setting 

f(t) = \t- 1| 

results in 


P-Q\=D f (P\\Q) 


d P 

-1 

d Q 


d Q 


= 2 sup (P(P)-Q(P)). 

7) Triangular Discrimination li62l . lll08t (a.k.a. Vincze-Le Cam distance): 


with 


Note that 


A(P\\Q) = D f (P\\Q) 


m 


(t- 1) 2 
t +1 


Ia(p\\q) = x 2 (pUp + Iq ) 

= x 2 (QW\p + \Q). 

8) Jensen-Shannon divergence ll67ll (a.k.a. capacitory discrimination): 

JS(P||Q) = D{P\\\P+ \Q) + D (Q || \P + \Q) 
= D f (P\\Q) 
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with 

f(t) = tlogt- (1 + f)log • (65) 

9) Ery divergence (see, e.g., [79} p. 2314]): For 7 > 1, 

E-y(P\\Q) = D f ^P\\Q) (66) 

with 

flit) = {t- 7 ) + (67) 

where (x) + = max{.x, ()}. E 1 is sometimes called “hockey-stick divergence” because of the shape of / 7 . If 
7 = 1, then 

E l {P\\Q) = \\P-Q\. (68) 

10) DeGroot statistical information OTIl : For p € (0,1), 

Z p {P\\Q) = D^PWQ) (69) 

with 


4> p {t) = min{p, 1 — p} — min{p, 1 — pt}. (70) 

Invoking (|66|)-([70|), we get (cf. |M} (77)]) 

lr{P\\Q) = \E 1 {P\\Q) = \\P-Q\. (71) 

This measure was first proposed by DeGroot OH due to its operational meaning in Bayesian statistical 


hypothesis testing (see Section VII-B), and it was later identified as an /-divergence (see 
11) Marton’s divergence lf73l pp. 558-559]: 

dl(P, Q) = minE [P 2 [X / Y \ Y]] 

= D s (P\\Q ) 


Theorem 10]). 

(72) 

(73) 


where the minimum is over all probability measures Pxy with respective marginals Px = P and Py = Q, 
and 


s(t) = (t-l) 2 l{t<l}. (74) 

Note that Marton’s divergence satisfies the triangle inequality [73] Lemma 3.1], and d^iP- Q) = 0 implies 
P = Q\ however, due to its asymmetry, it is not a distance measure. 
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C. Renyi Divergence 

Another generalization of relative entropy was introduced by Renyi lf88ll in the special case of finite alphabets. 
The general definition (assuming P<Q) is the following. 

Definition 4: Let P <C Q. The Renyi divergence of order a > 0 from P to Q is given as follows: 

• If a E (0,1) U (1, oo), then 

D a (P\\Q) = ^ log(E[exp(azp||Q(y))]) (75) 

= ~~i lo s( E [ ex P(( a “ 1 ) z -P||q( X ))]) C 76 ) 

with A" ~ P and Y ~ Q. 

• If a = 0, thei£] 

Oo(P|l<3) = log (77) 

• If a = 1, then 


P 1 (P||Q) = P(P||Q) 


(78) 


which is the analytic extension of D a (P\\Q) at a = 1. If D(P\\Q) < oo, it can be verified by L’Hopital’s 
rule that D(P\\Q) = lim at i D a (P\\Q). 

• If a = +oo then 

(Y)^j (79) 

with Y ~ Q. If P Q, we take D 00 (P\\Q) = oo. 

Renyi divergence is a one-to-one transformation of Hellinger divergence of the same order a E (0,l)U(l,oo): 

D a (P\\Q) = log (1 + (a - 1) jr a (P\\Q)) (80) 

a — 1 

which, when particularized to order 2 becomes 


Doo(P\\Q) = log ess sup 


dP 
d Q 


D 2 (P\\Q) = log (l + x 2 (P\\Q)). 


(81) 


Note that (|6|), ( |T7j ), ( |T8| ) follow from (80 1 and the monotonicity of the Renyi divergence in its order, which in 
turn yields ( fT6| ). 


8 Renyi divergence can also be defined without requiring absolute continuity, e.g., l37l Definition 2], 

9 The function in ([75} is, in general, right-discontinuous at a = 0. Renyi f88l defined Do(P\\Q) = 0, while we have followed l37l 
defining it instead as lima, ioDfiP\\Q). 
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Introduced in |[Sj], the Bhattacharyya distance was popularized in the engineering literature in [55]]. 
Definition 5: The Bhattacharyya distance between P and Q, denoted by B(P\\Q), is given by 


B(P\\Q) = \D,{P\\Q) 


(82) 



(83) 


Note that, if P <C3> Q, then B(P\\Q) = B(Q\\P) and B(P\\Q) = 0 if and only if P = Q, though B(P\\Q) 
does not satisfy the triangle inequality. 


III. Functional Domination 


Let / and g be convex functions on (0, oo) with /(1) = g(\) = 0, and let P and Q be probability measures 
defined on a measurable space (A, -P)- If there exists a > 0 such that /(f) < ag(t) for all t E (0, oo), then it 
follows from Definition [3] that 


D f (P\\Q)<aD g (P\\Q). 


(84) 


This simple observation leads to a proof of, for example, ( fT6] ) and the left inequality in Q with the aid of 
Remark Q] 

A. Basic Tool 

Theorem 1: Let P <C Q, and assume 

• / is convex on (0, oo) with /(1) = 0; 

• g is convex on (0,oo) with g( 1) = 0; 

• g(t) > 0 for all f E (0,1) U (1, oo). 

Denote the function n: (0,1) U (1, oo) —> M 



(85) 


and 


k = sup n(t). 
te( 0,i)u(i,oo) 


( 86 ) 


Then, 


a) 


D/(P||Q)<«A,(P||Q). 


(87) 
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b) If, in addition, /'(1) = g'{ 1) = 0, then 


Df(P\\Q) - 

P%D g (P\\Q) h 


( 88 ) 


Proof: 


a) The bound in (87 1 follows from 


and /(f) < ng(t) for all t > 0. 


b) Since g is positive except at f = 1, D g {P\\Q) > 0 if P / Q. The convexity of f,g on (0,oo) implies their 
continuity; and since g(t) > 0 for all f E (0, 1) U (l,oo), k(-) is continuous on both (0,1) and (l,oo). 


To show (88 1 , we fix an arbitrary u E (0,1) U (l,oo) and construct a sequence of pairs of probability 
measures whose ratio of /-divergence to ^-divergence converges to k{v). To that end, for sufficiently small 
e > 0, let P £ and Q e be parametric probability measures defined on the set A = {0, 1} with P £ ( 0) = v e 
and Q e (0) = e. Then, 


. P/(AII<3 

£^■0 Dg(P e \\Q, 


lim 


= lim 
£—/0 


= lim 


sg{u) + (1 - e) g (tzf) 

fiy) + ^ /(i - «) 

u-l 


g{y) + V— 5(1 - a) 

= k (y) 


(89) 


(90) 


(91) 


where (90 1 holds by change of variable e = a/(v — 1 + a), and ( |9l| ) holds by the assumption on the 
derivatives of / and g at 1, the assumption that /(1) = g( 1) = 0, and the continuity of k(-) at u. If R = k(i/) 
we are done. If the supremum in (86) is not attained on (0,1) U (1, oo), then the right side of ( |9T| ) can be 
made arbitrarily close to R by an appropriate choice of u. 


Remark 2: Beyond the restrictions in Theorem 1 11 , the only operative restriction imposed by Theorem 00 
is the differentiability of the functions / and g at f = 1. Indeed, we can invoke Remark [I] and add /'(1) (1 — t) 


to /(f), without changing Df (and likewise with g) and thereby satisfying the condition in Theorem 1 j|; the 
stationary point at 1 must be a minimum of both / and g because of the assumed convexity, which implies 
their non-negativity on (0, oo). 

Remark 3 : It is useful to generalize Theorem 00 by dropping the assumption on the existence of the 
derivatives at 1. To that end, note that the inverse transformation used for the transition to ( f90| ) is given by 
v = 1 + a (7 — 1) where e > 0 is sufficiently small, so if v > 1 (resp. v < 1), then a > 0 (resp. a < 0). 
Consequently, it is easy to see from ((90]) that if k = sup t>1 /c(f), the construction in the proof can restrict to 
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v > 1, in which case it is enough to require that the left derivatives of / and g at 1 be equal to 0. Analogously, 
if R = sup 0<t<1 n(t), it is enough to require that the right derivatives of / and g at I be equal to 0. When 


neither left nor right derivatives at 1 are 0 , then ( | 88 [ ) need not hold as the following example shows. 
Example 1: Let f(t) = \t — 1| and 

g{t) = 2 /(f) + 1 — t. 


Then, k = 1, while in view of (391 and (56 1 for all ( P , Q), 

D g (P\\Q) = 2D f (P\\Q) = 2\P-Q\. 

B. Relationships Among D(P\\Q), x 2 (-P|| Q) and \P — Q\ 

Since the Renyi divergence of order a > 0 is monotonically increasing in a, ( f 8 TT ) yields 

D(P\\Q)< log {l+ x 2 (P\\Q)) 

< X 2 (-P||Q)loge. 


(92) 


(93) 


(94) 

(95) 


Inequality ( |94[ ), which can be found in Il99ll and Bdl Theorem 5], is sharpened in Theorem 14 under the 
assumption of bounded relative information. In view of ( [801 ), an alternative way to sharpen ( [94] ) is 

1 


D(P\\Q) < 


a — 1 


log (1 + (a — l)Jf? a (P\\Q)) 


(96) 


for a £ ( 1 , 2 ), which is tight as a -I 1 . 

Relationships between the relative entropy, total variation distance and x 2 divergence are derived next. 
Theorem 2: a) If I’ -C Q and c \, C 2 > 0, then 

D(P\\Q) < (ci \P — Q\ + c 2 X 2 (P\\Q)) log e (97) 

holds if (ci, C2) = (0,1) and (ci, C2) = ( 3 , ^)- Furthermore, if c\ = 0 then C2 = 1 is optimal, and if C 2 = \ 
then ci = \ is optimal. 


b) 


D(P\\Q) + D(Q\\P) _ u 

SUp x 2 (P\\Q) + x 2 (Q\\P) 2 ° ge 

where the supremum is over P <> Q and P / Q. 

Proof: 


(98) 


a) The satisfiability of (101) with (ci,C 2 ) = (0,1) is equivalent to (95). 
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Let P <C Q, and Y ~ Q. Then, 

D(P\\Q) = E 


^% (Y) 


<5 E 


dP \ + (dP n2 

+ W Y >~\ 


log e 


= {\\P-Q\ + \x\PWQ)) log e 


(99) 

( 100 ) 

( 101 ) 


where (991 follows from the definition of relative entropy with the function r: (0,oo) —> M defined in (431; 


( 102 ) 


( | 100 [ ) holds since for t € ( 0 , oo) 

r(t) < 5 [( 1 - i ) + + (i- 1 ) 2 ] loge 

and ( 101 > follows from ( |47| ), ( [57] ), and the identity 

(i-f)+ = yi-ti + (i-f)]. 

This proves ( |97] ) with ( 01 , 02 ) = Q, |). 

Next, we show that if c\ = 0 then C 2 = 1 is the best possible constant in ( |97j ). To that end, let / 2 (f) = (i—l) 2 , 
and let n(t) be the continuous extension of It can be verified that the function k is monotonically 
decreasing on (0,oo), so 


k = lim K(t) = loge. 


(103) 


Since D r (P\\Q) = D(P\\Q) and Df 2 (P\\Q) = x 2 (P\\Q)> ar *d r'(l) = /^(l) = 0, the desired result follows 
from Theorem 00 - 

To show that c\ = \ is the best possible constant in (971 if C 2 = we let 52 (f) = \ [(1 — t) + + (t — l) 2 ]. 
Theorem 00 does not apply here since 52 is not differentiable at t = 1. However, we can still construct 
probability measures for proving the optimality of the point ( 01 , 02 ) = (\,\)- To that end, let e e (0,1), 
and define probability measures P e and Q e on the set A = {0,1} with P e ( 1) = e 2 and Q e { 1) = e. Since 
Dr{P\\Q) = D(P\\Q) and D 92 (P\\Q) = \ \P - Q\ + \ X 2 (P\\Q), 

lim 


40 l\Pe-Qe\ + hx 2 (Pe\\Qe) 

(1 - e) r(l + e) + er(e) 

= lim --------— 

40 (1 - 0 ) 52(1 +£) + 052 ( 0 ) 

= log e 


(104) 

(105) 
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where (105 > holds since we can write numerator and denominator in the right side as 


b) We have 


with 


e) r(l + e) + e r(e) = (1 - e 2 ) log(l + e) + e 2 log e 

(106) 

= e loge + o(e), 

(107) 

52(1 + e) 4regi(e) = e - e 2 . 

(108) 

D f (P\\Q) = D(P\\Q) + D(Q\\P), 

(109) 

D g {P\\Q) = x 2 (p\\Q) + x 2 (Q\\p) 

(110) 

f(t) = (t- 1) logf 

(111) 

g {t) = t 2 - t - 1 + * 

K i, t ) = g ^ = t 2 - 1 ’ * e (O’ 1) U (l,oo) 

lim n(t) = k = b log e 
i— >1 z 

(112) 

(113) 

(114) 


where ( |1 14[ ) is easy to verify since k is monotonically increasing on (0,1), and monotonically decreasing 
on (l,oo). The desired result follows since the conditions of Theorem [iff)]) apply. 


Remark 4: Inequality ( |97j ) strengthens the bound claimed in 021 (2.8)], 

D(P\\Q)<l(\P-Q\ + X 2 (P\\Q))loge, 


(115) 


although the short outline of the suggested proof in ll32l p. 710] leads to the weaker upper bound |P — Q\ + 
^X 2 (P\\Q) nats. 

Remark 5: Note that ( [95] ) implies a looser result where the constant in the right side of ( |98| ) is doubled. 
Furthermore, ([T]) and (98 1 result in the bound X 2 (P\\Q) + X 2 {Q\\P) — 2 \P — Q\ 2 which, although weaker than 
& has the same behavior for small values of \P — Q\. 


C. Relationships Among D(P\\Q), A(P\\Q) and \P — Q\ 

The next result shows three upper bounds on the Vincze-Le Cam distance in terms of the relative entropy 
(and total variation distance). 
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: be the function defined in ([43]), and let f A denote the function /: (0, oo) 


Theorem 3: Letr: (0, oo) - 
defined in ( [60] ). 
a) If P <C Q, then 

A(P||Q) log e < ci D(P\\Q) 
where c\ = 1.11591 = R, computed with 


n(t) = 


f A (t) loge 

r(f) 


Furthermore, the constant c\ in ( |116| ) is the best possible, 
b) If P <C Q, then 

A(P||Q) loge < D(P\\Q) + c 2 \P- Q\ loge 

where c 2 = 0.0374250 is defined by 

c 2 = minjc > 0: K c (t) < 1, Vi E (0,1)} 
/ A (f) loge 


n c {t) = 


r(t ) + 2c (1 — t) + loge’ 


Furthermore, the constant c 2 in ( |118| ) is the best possible, 
c) If P O Q, then 


A(P||Q) \oge<\D{P\\Q) + \D(Q\\P) 


(116) 


(117) 


(118) 

(119) 

( 120 ) 


( 121 ) 


and the bound in ( | 121 1 ) is tight. 

Proof: 

a) Let / = / A log e and g = r. These functions satisfy the conditions in Theorem [~i[[T[ ), and result in the 
function k defined in ( | 11 7[ ). The function k is maximal at t* = 0.223379 with c\ = k( t*) = 1.11591. Since 


Df(P\\Q) = A(P||Q) loge and D g (P\\Q) = D(P\\Q), Theorem 1 ?! results in ( 1 16| ) with the optimality of 
its constant ci. 


b) By the definition of c 2 in ( | 1 19| ), it follows that for all t > 0 

f A (t) loge < r(t) + 2c 2 (1 - t) + loge; ( 122 ) 

note that (| 1 22[) holds for all t G (0, oo) (i.e., not only in (0,1) as in (119)) since, for t > 1, (122 1 is equivalent 


t° /aW loge < r(t) which indeed holds for all t G [l,oo). Hence, the bound in (118) follows from 


and (| 68 [). It can be verified from (119) that c 2 = 0.0374250, the maximal value of k C2 on (0,oo) 
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is attained at t* = 0.122463 and n C2 (t *) = 1. The optimality of the constant C 2 in ( | 11 8[ ) is shown next (note 
that Theorem [T[[t| > cannot be used here, since the right side of ( | 1 22[ ) is not differentiable at t = 1). Let P £ 
and Q £ be probability measures defined on A = {0,1} with P £ { 0) = t*e and (Je(0) = e for e £ (0,1). 
Then, it follows that 

A(P e ||Q e ) loge 


£^o D(P s \\Q e ) + c 2 I P £ - Q e I loge 
/ A (f*)e loge + o(e) 


= lim 


e->o [r(t*) + 2 c 2 (1 — t*) loge] e + o(e) 
= k C2 (P) = 1 


(123) 

(124) 


where ( |123| ) follows by the construction of P £ and Q e , and the use of Taylor series expansions; 1 24[ ) follows 
from ( |120[ ) (note that t* < 1), which yields the required optimality result in ( ] 11 8| ). 
c) Let f = f a loge and g : (0, oo) -A 1 be defined by g(t) = (t — 1) logf for all t > 0. These functions, 

l, yield D f (P\\Q) = A(P||Q) loge and D g (P\\Q) = D{P\\Q) + 


which satisfy the conditions in Theorem 


1 r 


D(Q\\P). Theorem 1 11 yields the desired result with 

_ t ~ l 
^ ~ (t + l)log e t’ 

. , 1 

«(!) = 2 = K ' 


t G (0,1) U (1, ex;) 


(125) 

(126) 


Remark 6: We can generalize d 116| ) and ( |121[ ) to 

A(P||Q) loge < ci D(P\\Q) + c 2 D{Q\\P) (127) 

where ci, c 2 > 0 and P <CS> Q. In view of Theorem [TJ the locus of the allowable constant pairs (ci, c 2 ) in ( |127[ ) 

can be used with /(f) = / A (f) loge, and g(t) = cir(f) + c 2 fr (|) 


can be evaluated. To that end, Theorem 


1 r 


for all t > 0. This results in a convex region, which is symmetric with respect to the straight line ci = c 2 (since 
A(P||Q) = A(Q||P)). Note that ( |116[ ), ( |121[ ) and the symmetry property of this region identify, respectively, 
the points (1.11591,0), (|, |) and (0,1.11591) on its boundary; since the sum of their coordinates is nearly 
constant, the boundary of this subset of the positive quadrant is nearly a straight line of slope —1. 


D. An Alternative Proof of Samson's Inequality 

An analog of Pinsker’s inequality, which comes in handy for the proof of Marton’s conditional transportation 
inequality lfl3l Lemma 8.4], is the following bound due to Samson lf89l Lemma 2]: 


December 6, 2016 


DRAFT 








































SASON AND VERDU: /-DIVERGENCE INEQUALITIES 


20 


Theorem 4: If P<cQ, then 

d 2 2 (P,Q) + d 2 2 (Q,P) < ^D(P\\Q) (128) 

where (hiP, Q) is the distance measure defined in ( [72] ). 

We provide next an alternative proof of Theorem [4} in view of Theorem In, with the following advantages: 
a) This proof yields the optimality of the constant in ( |128[ ), i.e., we prove that 

4(P,Q) + 4(Q,P) _ 2 


sup 


(129) 


D(P\\Q) loge 

where the supremum is over all probability measures P, Q such that P f Q and P <C3> Q. 
b) A simple adaptation of this proof enables to derive a reverse inequality to ( ] 128 ), which holds under the 
boundedness assumption of the relative information (see Section IV-D[ ). 

Proof: 

d\ (P, Q) + di(Q, P) = D S {P\\Q) + D S *(P\\Q) (130) 

= D f (P\\Q) (131) 


where, from ( |74| ), s *: (0, oo) -A [0, oo) is the convex function 

„> (t)=ta( i )= (*-l) 2 1P>1} 


(132) 


in (131 1 is given by 


(133) 


and, from ((74]) and d 1 32| ), the non-negative and convex function /: (0, oo) 

/(() = ■»(*) + s *(«) = 

for all t > 0. Let r be the non-negative and convex function r: (0, oo) -A R defined in ( |43| ), which yields 
D r (P\\Q) = D(P\\Q). Note that /(1) = r(l) = 0, and the functions / and r are both differentiable at t = 1 
with /'(1) = r'{ 1) = 0. The desired result follows from Theorem i since in this case 

(f-1) 2 


' ' ^ r(t ) max{l. t} ’ 

= «> 


t G (0,1) U (1, oo) 


(134) 

(135) 


as can be verified from the monotonicity of k on (0,1) (increasing) and (l,oo) (decreasing). ■ 

Remark 7; As mentioned in ll89l p. 438], Samson’s inequality d 1 28| ) strengthens the Pinsker-type inequality 
in ll73l Lemma 3.2]: 

4(P,Q) < idi rnm{D(P\\Q), D(Q\\P)} (136) 

Nevertheless, similarly to our alternative proof of Theorem [4] one can verify that Theorem |~[f(o|( yields the 


optimality of the constant in (136). 
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E. Ratio of /-Divergence to Total Variation Distance 

Vajda fim Theorem 2] showed that the range of an /-divergence is given by (see (351) 

0 < Df(P\\Q) < /(0) + /*(0) 


(137) 


where every value in this range is attainable by a suitable pair of probability measures P <C Q. Recalling 
Remark [lj note that /&(0) + 0) = /(0) + /*(0) with /;,(•) defined in ( |38l ). Basu et al. J9] Lemma 11.1] 
strengthened ( |137| ), showing that 

Df(P\\Q) < l (/( 0 ) + /*(0)) \P — Q\. (138) 

Note that, provided /(0) and /*(()) are finite, (|138[) yields a counterpart to ([24]). Next, we show that the constant 


(139) 


in ( |138| ) cannot be improved. 

Theorem 5: If /: (0, oo) — > M is convex with /(1) = 0, then 

sup Tp^f = i (/(0) + / ‘ (0)> 

where the supremum is over all probability measures P, Q such that and P f Q. 

Proof: As the first step, we give a simplified proof of d 1 38| ) (cf. |9j pp. 344-345]). In view of Remark [T] 
it is sufficient to show that for all t > 0 , 


m + h (/(o) - /*( o)) (t- 1 ) < i (/(o)+ rm \t - 11 


(140) 


If t G (0,1), ( | 140[ ) reduces to /(f) < (1 — f)/(0), which holds in view of the convexity of / and /(1) = 0. If 
t > 1, we can readily check, with the aid of ( [34] ), that ( 140 ) reduces to /*(/) < (1 - j)f*( 0 ), which, in turn, 
holds because /* is convex and /*( 1 ) = 0. 

For the second part of the proof of ( | 1 39| ), we construct a pair of probability measures P £ and Q e such that, 
for a sufficiently small e > 0, D \p P /}'q\' ) can be made arbitrarily close to the right side of ( | 139[ ). To that end, 
let e £ (0 ,t; (\/5 — 1)], and let P £ and Q e be defined on the set A = {0,1, 2 } with P e (0) = Q e { 1) = e and 
P £ ( 1) = Qe( 0) = e 2 - Then, 

Df(Pe\\Q, 


lim _ ^ 

e—| P £ — Q £ 


!£J = lim e/(e) + £2/ ^ 


e—>o 2e (1 — £) 

= i /(o) + 1 no) 


(141) 

(142) 


where (141 1 holds since P £ { 2) = Q e { 2), and (142) follows from (33i and (35 l. 
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Remark 8: Csiszar Il24l Theorem 2] showed that if /(0) and /*(()) are finite and P -C Q. then there exists 
a constant Cf > 0 which depends only on / such that 

D f {P\\Q) <C fy /\P-Q\. (143) 

If \P — Q\ < 1, then ( |143[ ) is superseded by ( | 138 l where the constant is not only explicit but is the best possible 
according to Theorem [5] 

A direct application of Theorem [5] yields 
Corollary 1: 

Jf a (P\\Q) 1 


p^q \P-Q\ 2(1-a)’ 

A(P||Q) 


V« e (0,1) 


sup ip n\ 

PjkQ I p - Q\ 

JS(P||Q) 


= l. 


sup ip n \ 

PjkQ \P-Q\ 

4(P,Q) _ 1 


= log 2, 


p%\P-Q\ 2’ 

4(P,Q) + 4{Q,P) 


sup 

p^Q 


= 1 


(144) 

(145) 

(146) 

(147) 

(148) 


\P~Q\ 

where the suprema in ( | 144 1 -( 147 1 are over all P < Q with P ^ Q, and the supremum in ( 148 1 is over all 
P O Q with P ^ Q. 

Remark 9: The results in ( 144 ), ( 145 ) and ( 146[ ) strengthen, respectively, the inequalities in lf65l Proposi¬ 
tion 2.35], llOll (11)] and [ 101, Theorem 2], The results in ( |147| ) and d 148[ ) form counterparts of ( 129[ ). 


IV. Bounded Relative Information 

In this section we show that it is possible to find bounds among /-divergences without requiring a strong 


condition of functional domination (see Section III I as long as the relative information is upper and/or lower 
bounded almost surely. 

A. Definition of d\ and //>. 

The following notation is used throughout the rest of the paper. Given a pair of probability measures (P, Q) 
on the same measurable space, denote fix, @2 £ [0,1] by 

/3i = exp(-T> 0O (P||Q)), (149) 

/? 2 = exp(-P 00 (Q||P)) (150) 
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with the convention that if D^PWQ) = oo, then 0\ = 0, and if D 00 (Q\\P) = oo, then 02 = 0. Note that if 
0i > 0, then P <C Q, while 02 > 0 implies Q <C P. Furthermore, if P <C3> Q, then with 1" ~ Q, 


d Q 


0i = ess inf — (Y) = I ess sup — (Y) 


d P 
d P 


d P 


02 = ess inf — (Y) = I ess sup — (Y) 


d Q 


d Q 
d Q 


d P 


-1 


-1 


(151) 

(152) 


The following examples illustrate important cases in which 0\ and 02 are positive. 

Example 2: (Gaussian distributions.) Let P and Q be Gaussian probability measures with equal means, and 
variances (Tq and o\ respectively. Then, 


0 i = — l{oo < 0 1}, 
<7 1 

02 = — 1{<J1 < Op}- 

< 7(1 


(153) 

(154) 


Example 3: (Shifted Laplace distributions .) Let P and Q be the probability measures whose probability 
density functions are, respectively, given by f\(- — oq) and j\ (• — ai) with 


f\(x) = § exp(—A|x|), x e 


where A > 0. In this case, d 155}) gives 


dP 

— (x) = exp(A(|x - ai| - \x - a 0 |)), 


x G 


which yields 


01= 02 = exp (—A |ai - a 0 |) G (0,1]. 


(155) 


(156) 


(157) 


Example 4: (Cramer distributions.) Suppose that P and Q have Cramer probability density functions fo 1 ,m 1 
and fe 0 ,m 0 , respectively, with 


fd,m{ x ) — 


X G 


(158) 


2(1 + 9\x — m\) 2 ' 

where 6 > 0 and m G M. In this case, we have 0i, 02 G (0,1) since the ratio of the probability density functions, 
, tends to ^ < 00 in the limit x —>• ±00. In the special case where mo = mi = m G R, the ratio of these 

Je 0 ,m 0 "1 

probability density functions is ^ at x = m; due also to the symmetry of the probability density functions 
around m, it can be verified that in this special case 


01 = 02 = nun \ —, — 


(159) 
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Example 5: (Cauchy distributions.) Suppose that P and Q have Cauchy probability density functions g ri , mi 
and respectively, 70 / 71 and 

2" 




x = 


1 

7T7 


1 + 


x — m 


7 


-1 


x e 


(160) 


where 7 > 0. In this case, we also have j3\, P 2 E (0,1) since the ratio of the probability density functions tends 


to 77 < 00 in the limit x —> ±00. In the special case where mo = mi, 


Pi = Pi = min 


7i To 
To’ 7i 


(161) 


B. Basic Tool 


Since Pi = 1 77 P 2 = 1 77 P = Q, it is advisable to avoid trivialities by excluding that case. 

Theorem 6: Let / and g satisfy the assumptions in Theorem [T] and assume that (p \, P 2 ) E [0, l) 2 . Then, 


where 


D f (P\\Q) < k* D g (P\\Q) 


sup k(P) 
j 9e(A,,i)u(i^r 1 ) 


(162) 


(163) 


(164) 


and k(-) is defined in 

Proof: Defining g(0) and g*(0) as in (|33])-(|35]), respectively, note that 

f(0)Q(p = 0) <g(0) k*Q( P = 0) 
because if /?2 = 0 then /(0) < k* g(0), and if P 2 > 0 then Q(p = 0) = 0. Similarly, 

f*{0)P(q = 0)<g*(0)K*P(q = 0) (165) 

because if /?i = 0 then /*(0) < k* g*(0) and if /5i > 0 then P(q = 0) = 0. Moreover, since /(1) = g(l) = 0, 


we can substitute pq > 0 by {pq > 0} (7 {p q) in the right side of ( [37] ) for Df(P\\Q) and likewise for 
D g (P\\Q). In view of the definition of re*, ( | 1 64[ ) and ( |165[ ), 


Df{P || Q)<k* qg ( - ) dp 

J pq>0,p^q \9 / 

+ re* g{ 0) Q(p = 0) + re* 5 *(0) P(q = 0) 
= «* D g {P\\Q). 


(166) 

(167) 
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Note that if ff = @2 = 0, then Theorem [ 6 ] does not improve upon Theorem In. 

Remark 10: In the application of Theorem [ 6 j it is often convenient to make use of the freedom afforded by 
Remark [T] and choose the corresponding offsets such that: 

• the positivity property of g required by Theorem [ 6 ] is satisfied; 

• the lowest n* is obtained. 

Remark 11: Similarly to the proof of Theorem 1 Ji ). under the conditions therein, one can verify that the 
constants in Theorem [ 6 ] are the best possible among all probability measures I\ Q with given (d \, 82 ) £ [0, l) 2 . 
Remark 12: Note that if we swap the assumptions on / and g in Theorem [ 6 ] the same result translates into 


inf n( 8 ) ■ D g (P\\Q) < D f (P\\Q). 
y3e(/3 2 ,l)U(l„Sp ) 


(168) 


Furthermore, provided both / and g are positive (except at t = 1) and k is monotonically increasing, Theorem [ 6 ] 


and ( 1681 result in 


K (P 2 )D g (P\\Q)<D f (P\\Q) (169) 

<k{P^)D 9 {P\\Q). (170) 

In this case, if /3i > 0, sometimes it is convenient to replace /3 1 > 0 with 8 \ e (0, /3i) at the expense of 
loosening the bound. A similar observation applies to (3 2 . 

Example 6 : If /(f) = (f — l ) 2 and g(t) = \t — 1|, we get 


X (P\\Q) < max{^ - 1,1-02} \P~QV 


(171) 


C. Bounds on 


The remaining part of this section is devoted to various applications of Theorem [ 6 j From this point, we make 


use of the definition of r: (0,oo) —> [0,oo) in ( |43j ). 

An illustrative application of Theorem [ 6 ] gives upper and lower bounds on the ratio of relative entropies. 

Theorem 7: Let P <S 2 > Q, P / Q, and (fii, P 2 ) F (0, l) 2 . Let n: (0,1) U (l,oo) —> (0, 00 ) be defined as 

tlogf + (1 — t) loge 


«(*) = 


(f — 1 ) loge — logf 


Then, 


Proof: For t > 0, let 


^ £ mm - ^ 

git) = - logf + (f - 1 ) loge 


(172) 


(173) 


(174) 
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then D g (P\\Q) = D(Q\\P), D r (P\\Q) = D(P\\Q) and the conditions of Theorem [6] are satisfied. The desired 
result follows from Theorem [6] and the monotonicity of k(-) shown in Appendix |a| ■ 


D. Reverse Samson’s Inequality 

The next result gives a counterpart to Samson’s inequality ( | 128 1. 

Theorem 8: Let (/0i,/?2) G (0, l) 2 . Then, 

. f 4(P,Q) + 4(Q,P) . ( ra -u , aU 

mf - D(P\\Q) - = 

where the infimum is over all P <C Q with given and where k: (0,1) U (l,oo) 

in ( fT34l ). 


(175) 

(°’ kji) iS § iven 


Proof: Applying Remark 12 to the convex and positive (except at t = 1) function f{t ) given in ( 133 ), 
and gif) = r(t), the lower bound on ) j n t | lc tight side of ( |175[ ) follows from the fact that ( |134[ ) 

is monotonically increasing on (0,1), and monotonically decreasing on (l,oo). To verify that this is the best 


possible lower bound, we recall Remark 11 since in this case /'(1) = g'{ 1) = 0. 


E. Bounds on 


D(P\\Q) 

Jf a (P\\Q) 


The following result bounds the ratio of relative entropy to Hellinger divergence of an arbitrary positive order 
a 1. Theorem [9] extends and strengthens a result for a G (0,1) by Haussler and Opper I15T1 Lemma 4 and (6)] 
(see also [[16)), which in turn generalizes the special case for a = \ obtained simultaneously and independently 
by Birge and Massart ifTTl (7.6)]. 

Theorem 9: Let P « Q, P / Q, a G (0,1) U (l,oo) and (/?i,/^) G [0, l) 2 . Define the continuous function 
on [0, oo]: 

t = 0; 

t G (0,1) U (1, oo); 

t = 1; ^ 176 ) 

t = oo and a G (0,1); 
t = oo and a G (1, oo). 

Then, for a G (0,1), 


Ka(t) = < 


log e 

(1 -a)r(t) 

1 -t a -\-a.t—a 

a~ 1 log e 
oo 

0 


life) < 


D(P\\Q) 

^a(P\\Q) 


A Kaifil 


-1\ 


(177) 
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and, for a G (L oo), 


M/V 1 ) < < ««(&)• 


^a(P||Q) 

Proof: 

D r (P\\Q ) = £>(P||Q), and D ff „(P||Q) = J&(P||Q) with 

1 — f Q + at — a 


(178) 


ffa(f) = 


1 — a 


t E (0, oo) 


(179) 


in view of Remark 0 © and ( |52j ). Since g' a (t ) = Q(1 1 _ f a —■* g a is monotonically decreasing on (0,1] and 
monotonically increasing on [l,oo); hence, g a (1) = 0 implies that g n is positive except at t = 1. The convexity 
conditions required by Theorem [6] are also easy to check for both r(-) and g a (■). The function in ( 176| ) is the 
continuous extension of which as shown in Appendix [B| is monotonically increasing on [0, oo] if a E (0,1), 
and it is monotonically decreasing on [0, oo] if a E (l,oo). Therefore, Theorem [6] results in d 177 ) and ( |178| ) 
for a E (0,1) and a E (1, oo), respectively. ■ 

Remark 13: Theorem|9jis of particular interest for a = (. In this case since, from ( |176[ ), m : [0, oo] — > [0, oo] 
is monotonically increasing with ki( 0) = loge, the left inequality in ( |178[ ) yields ( [T6] ). 

For large arguments, ki (•) grows logarithmically. For example, if > 9.56-10 -9 , it follows from ( 177 1 that 

2 log e 2 


D(P\\Q) <14 + 


Jt?i(P\\Q) nats 


(180) 


(1-e- 1 ) 2 , 

which is strictly smaller than the upper bound on the relative entropy in [ 111, Theorem 5], given not in terms of 


/3i but in terms of another more cumbersome quantity that controls the mass that ^ may have at large values. 


As mentioned in Section 


II-B 


X 2 {P\\Q) is equal to the Flellinger divergence of order 2. Specializing 


Theorem [9] to the case a = 2 results in 

K2iKl) ~ PMq) ~ K2</S2) ’ 

which improves the upper and lower bounds in lf34l Proposition 2]: 

1* D(P\\Q) ^ la _. 


2 Pi log e < 


< 2^2 lo § e - 


(181) 


(182) 


x 2 (P\\Q) 

For example, if (3i = @2 = ( 181 ) gives a possible range [0.037,0.9631] nats for the ratio of relative entropy 

to x 2 -divergence, while d 1 82| ) gives a range of [0.005,50] nats. Note that if /?2 = 0, then the upper bound in 


(TFT) is k' 2 (0) = loge whereas it is oo according to lf34l Proposition 2], In view of Remark 11 the bounds in 
(181 1 are the best possible among all probability measures P,Q with given (/3 i,/? 2) £ [0, l) 2 . 
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F Bounds on JS ( P H Q ) WWQ) A(P||Q) 
r. Dounus on a(P ||q). js(P\\Q) ’ \P-Q\ 

Let and P f Q. IlOll Theorem 2] and 1 1011 (11)] show, respectively, 

l\P-Q\ 2 <A(P\\Q)<\P-Q\. 


The following result suggests a refinement in the upper bounds of ( |183| ) and ( | 184 1 . 
Theorem 10: Let P <C Q, P / Q, and let d\. 82 G [0,1). Then, 


loge - A(pIIq) - max i Kl (P 1 *)’ Kl (/ ?2 )}> 

< nraxjKs^r 1 ), k 2 (/3 2 )} 


with m: [0,oo] —A [0, log 2] and k 2 : [0, 00 ] —> [0,1] defined by 

log 2 


t = 0: 


Ki(f) = < 


p 3 T )2 [flogf- (t + l)log(^)] t G (0,1) U (l,oo); 


\ loge 
log 2 


t = 1; 

t = 00 ; 


and 


«2(i) = 


l«-i| 

t+i 


1 


(183) 

(184) 


(185) 

(186) 


(187) 


(188) 


t G [0, 00 ); 
t = 00 . 

Proof: Let /tv, /a,/js denote the functions /: (0, 00 ) —> M in (55), ( |60l ) and ( |65] ), respectively; these 
functions yield |P — Q\, A(P||Q) and JS(P||Q) as /-divergences. The functions k\ and k 2 , as introduced in 
( |187[ ) and ( ] 1 88| ), respectively, are the continuous extensions to [0, 00 ] of 

hs(t) + (t - 1) log 2 


«i(i) = 


^2 (t) = 


/a(*) 


/aW 

/tv(0 


(189) 

(190) 


It can be verified by ( |187[ ) and ( |188[ ) that = k, (|) for t G [0, 00 ] and i G {1, 2}; furthermore, K\ and k 2 are 
both monotonically decreasing on [0,1], and monotonically increasing on [l,oo]. Consequently, if /3 G [/? 2 , // 1 ] 
and z G {1,2}, 


«i(l) < Ki(/3) < max{ Ki (/^ 1 ),/«i(/S 2 )}. 


(191) 
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In view of Theorem [6] and Remark [TJ ( |185[ ) follows from ( |189| ) and ( |191[ ); ( |186| ) follows from ( [ 19Q[ ) and ( |191[ ). 


If fi\ = 0, referring to an unbounded relative information ip\\Q, the right inequalities in ( | 1 83[ >, ( |184[ ) and 
( |185[ ), ( | 1 86| ) respectively coincide (due to ( |187[ ), ( | 1 88[ ), and since KafO) = 1); otherwise, the right inequalities 
in ( 185| ), ( |186[ ) provide, respectively, shaiper upper bounds than those in ( ] 183 >, ( 184 ). 

Example 7: Suppose that P <C Q, P / Q and ^ ^ (a) < 2 for all a G A (e.g., P = and 

Q = Q, |)); substituting /3i = = \ in ( ] 1 85| )— ( jT88| ) gives 


5 l0ge£: l^ £0 ' 510 l0ge 

a ino) < i 

\P-Q\ ~ 3 


(192) 

(193) 


improving the upper bounds on '^p|^ and which are log 2 « 0.693 loge and 1, respectively, according 


to the right inequalities in d 1 83[ >, ( |184| ). 

For finite alphabets, 1711 Theorem 7] shows 


, JS(PIIQ) , 

log2 < || < loge. 


(194) 


J&(P\\Q) 

2 

The following theorem extends the validity of ( |194| ) for Hellinger divergences of an arbitrary order a G (0, oo) 
and for a general alphabet, while also providing a refinement of both upper and lower bounds in ( 194| ) when 
the relative information ip\\Q is bounded. 


Theorem 11: Let P <C Q, P / Q, a G (0, 1) U (1, oo), and let /?i ,/?2 G [0, 1] be given in ( |149[ )-( fT5Ql ). Let 
K a : [0, oo] —> [0, oo) be given by 

log 2 t = 0; 


(«-i) 


K a (t) = < 


tlogt-(t+l)log(i±i) 


t a -\-OL—l—at 


loge 

2a 


t G (0,1) U (1, oo); 
t = 1; 

t = OO. 


(195) 


Then, if a G (0,1), 


and, if a G (1, oo), 


mm 


, (^- 1 ) +1 °g 2 

{KaWl 1 ), «<*(&)} < < I™,, «a(0) 

a{P\\Q ) 0£[02,Pi ] 


! JS(P||Q) m 

a{l 1 } - Jt? a (P\\Q) - 


(196) 


(197) 
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Proof: Let f a and /jg denote, respectively, the functions /: (0, oo) —> K in ( [53] ) and ( [65] ) which yield the 
Hellinger divergence, and the Jensen-Shannon divergence, JS(P||Q), as /-divergences. From ( |195[ ), n a is 
the continuous extension to [0, oo] of 


m _ /js(*) + (t- 1) log 2 


(198) 


As shown in the proof of Theorem [ 9 J for every a 6 (0,1) U (1, 00 ) and t 6 (0,1) U (1, 00 ), we have g a (t) = 
f n (t) + (uAi) (/~1)> 0. It can be verified that k q : [0, 00 ] —>• R has the following monotonicity properties: 

• if a £ (0,1), there exists ta ^ 0 such that n n is monotonically increasing on [0, t a ], and it is monotonically 
decreasing on [i a ,oo]; 

• if a € (l,oo), K a is monotonically decreasing on [0, 00 ]. 

Based on these properties of K a , for 13 € [^ 2 - /5/ 1 ]: 


• if a € (0,1) 


• if a 6 (1, 00 ), 


min{«: Q (/3 1 1 ),ac q (/3 2 )} < n a (/3) < max K a (t)\ 

iS[/3 2 ,/3! ] 


(199) 


Ka(Pi X ) < «a(/5) < K a (P 2 ). 


( 200 ) 


In view of Theorem [6j Remark [T] and ( 198 >, the bounds in ( 196 ) and (197) follow respectively from ( 199 ) and 
( f2QQ[ ). ■ 

Example 8: Specializing Theorem [TTj to a = |, since m (t) = k i (J) for t G (0, 00 ) and ki achieves its 
global maximum at t = 1 with ki( 1) = loge, ( 196| ) implies that 

JS(P||Q) 


log2 < ki ^minj^ 1 , 


< 


^i{P\\Q) 

Under the assumptions in Example [7] it follows from ( |201[ ) that 

JS(P||Q) 


< log e. 


0.990 loge < 


3fa{P\\Q) 


< log e 


which improves the lower bound log2 « 0.693 loge in the left side of ( |194[ ). 
Specializing Theorem 11 to a = 2 implies that (cf. ( |195[ ) and ( 197 1 ) 

Under the assumptions in Example [7] it follows from ( |203[ ) that 

0.170 loge < < 0.340 loge 

X 2 {P\\Q) 


( 201 ) 


( 202 ) 


(203) 


(204) 
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while without any boundedness assumption, d 1 97[ ) yields the weaker upper and lower bounds 

JS(P||Q) 


0 < 


— a/2 


x 2 (^IIQ) 


< log 2 « 0.693 loge. 


(205) 


G. Local Behavior of f-Divergences 

Another application of Theorem [6] shows that the local behavior of /-divergences differs by only a constant, 
provided that the first distribution approaches the reference measure in a certain strong sense. 

Theorem 12: Suppose that {P n }, a sequence of probability measures defined on a measurable space (A, -P), 
converges to Q (another probability measure on the same space) in the sense that, for Y ~ Q, 

d p 

lim ess sup —^ (Y) = 1 (206) 

n-t oo aQ 

where it is assumed that P n <C Q for all sufficiently large n. If / and g are convex on (0, oo) and they are 
positive except at t = 1 (where they are 0), then 


lim D f (P n \\Q) = lim D g (P n \\Q) = 0, 


(207) 


and 


min{«;(l ),k( 1“ i ")}< lim < max{/-c(l ),k(1 + )} 


(208) 


^(i’nllQ) 

where we have indicated the left and right limits of the function ac(-), defined in (85 I, at 1 by /v(l _ ) and k( 1 + ), 
respectively. 


Proof: Since /(1) = 0, 


where we have abbreviated 


The condition in (206[) yields 


0<D f (P n \\Q) = J f 


d Pn 

d Q 


d Q 


< sup f(/3) 
/3s[/3 2 ,»,ft 1] 


Pl,n = 6SSSUP /^ ( y )’ 

@2,n — ess inf (F). 


(209) 

( 210 ) 

( 211 ) 

( 212 ) 


lim Pi n = 1, 

oo 


lim p 2 ,n = 1 


(213) 

(214) 
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where ( |213[ ) is a restatement of ( |206[ ) (see the notation in ( |21 1 [ )), and Appendix [C] justifies ( |214[ ). Hence, ( |207[ ) 
follows from ( |2 10 ), ( |213[ ), ( |214[ ), the continuity of / at 1 (due to its convexity). 

Abbreviating I n = [/? 2 ,n, 1) U (1,/?]"*], ( |162| ) and ( |168[ ) result in 


inf K(p)D g {P n \\Q) < D f {P n \\Q) < sup K tf)D g {P n \\Q). 
PtP p&P 

The right and left continuity of «(■) at 1 together with ( |213[ ) and ( |214[ ) imply that 

inf k(/3) -A min{/t(l“), rc(l + )}, 

sup k(/3) max{«:(l _ ), k(1 + )} 

/3e/„ 

by letting n —> oo. 


(215) 

(216) 
(217) 


Corollary 2: Let {P n <C Q\ converge to Q in the sense of ( 206 ). Then, D(P n \\Q) and i)(Q\\P n ) vanish as 
n —> oo with 


lim = 1. 


(218) 


n—too D(Q\\P n 

Corollary 3: Let {P n <C Q} converge to Q in the sense of ( |206 1. Then, x 2 (Pn\\Q) and D(P n \\Q) vanish as 
n —> oo with 

D(Pn\\Q) 


1 i rn _ 

n ~>°° X 2 (Pn\\Q) 2 


= 3 lo g e - 


(219) 


Note that ( |219[ ) is known in the finite alphabet case ll28l Theorem 4.1]). 

In Example [TJ the ratio in ( |208[ ) is equal to while the lower and upper bounds are | and 1, respectively. 
Continuing with Examples |3j [4] and |5} it is easy to check that ( 206 > is satisfied in the following cases. 
Example 9: A sequence of Laplacian probability density functions with common variance and converging 
means: 


A 


Pn(x) = - ■ exp( —A|x - O n |) 


lim a n = a. 

n—>-oo 

Example 10: A sequence of converging Cramer probability density functions: 

0 n 


Pn{x) = 


2(l + On\x - m n I) 


2 ’ 


x G 


lim rrin = m G 


lim 0 n = 0 > 0. 

n—>-oo 


( 220 ) 

( 221 ) 

( 222 ) 

(223) 

(224) 
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Example 11: A sequence of converging Cauchy probability density functions: 


Pn{x) = — 

TT7n 


1 + 


x — m r 


In 


-1 


x G 


lim m n = m£ 

n—f oo 


lim 7n = 7 > 0- 


(225) 

(226) 
(227) 


H. Strengthened Jensen ’s inequality 

Bounding away from zero a certain density between two probability measures enables the following strength¬ 
ened version of Jensen’s inequality, which generalizes a result in li33l Theorem 1], 

Lemma 1: Let /: 1 -> 1 be a convex function, Pi <C Pq be probability measures defined on a measurable 
space (A. /P), and fix an arbitrary random transformation Pz\x '■ A —> M. Denotep*] P 0 -A Pz\x Pz 0 , and 
Pi -> Pz\x p z r Then, 

(5 (E [f(E[Zo\X 0 ])} - /(E[Z 0 ])) < E[f(E[Zi\Xi])] - /( E[Z X ]) (228) 


where Xq rs_/ Po, Xt Pi, and 

dp 

P = ess inf — (X 0 ). (229) 

<JTo 

Proof: If (5 = 0, the claimed result is Jensen’s inequality, while if f> = 1, If = Pi and the result is trivial. 
Hence, we assume f3 G (0,1). Note that Pi = /3P 0 + (1 — fi)P 2 where P 2 is the probability measure whose 


density with respect to P 0 is given by 

dP 2 _ 1 fdPi 

dP 0 “ l-P \dPo 

Letting P 2 —> Pz\x Pz 2 , Jensen’s inequality implies 


P >0. 


(230) 


f(K[Zi]) < P f(E[Z 0 ]) + (1 - P) f(E[Z 2 ]). 


(231) 


Furthermore, we can apply Jensen’s inequality again to obtain 

f(E[Z 2 }) = f(E[E[Z 2 \X 2 }}) 

<E[f(E[Z 2 \X 2 ])) 

= E[f(E[Zi\Xi])]-pE[f{E[Z 0 \X 0 })} 

1 ~P 


(232) 

(233) 

(234) 


10 We follow the notation in fTTol where P 0 —> Pz\x —■t Pz 0 means that the marginal probability measures of the joint distribution 
PoPz \x are P 0 and P Zo . 
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Substituting this bound on /(E[Z 2 ]) in ( |231[ ) we obtain the desired result. ■ 

Remark 14: Letting Z = X, and choosing P 0 so that (3 = 0 (e.g., f\ is a restriction of P 0 to an event of 
P 0 -probability less than 1), ( |228| ) becomes Jensen’s inequality f(E[X 1 ])<E[/(X 1 )]. 

Lemma [T] finds the following application to the derivation of /-divergence inequalities. 

Theorem 13: Let /: (0, oo) —> M be a convex function with /(1) = 0. Fix P -C Q on the same space with 
(/5i, /? 2 ) £ [0, l) 2 and let X ~ P. Then, 


fa D f {P\\Q) < E [/ (exp(* P ||Q(X)))] - /(I + x\P\\Q)) (235) 

<P^D f (P\\Q). (236) 

Proof: We invoke Lemma [I] with Pz\x that is given by the deterministic transformation exp (^p||q(-)) : A ~i► 
M. Then, E[Z 0 |X 0 ] = exp (zp||g(X 0 )). If, moreover, we let X 0 ~ Q = P 0 , we obtain 

E[Z 0 ] = 1, (237) 

E[f(E[Z 0 \X 0 ])} = D f (P\\Q) (238) 


and if we let Xi ~ P = P l5 we have (see ( |50| )) 

E[Z 1 ] = 1 + x 2 (P||Q), 

E [fmZfXfa] = E [/ (exp(z P ||Q(X)))] . 


(239) 

(240) 


Therefore, ([235 I follows from Lemma [I] Recalling ( | 151 1 ), inequality ([236 1 follows from Lemma [I] as well 
switching the roles P 0 and Pi, namely, now we take P = P 0 and Q = I\. ■ 


Specializing Theorem 13 to the convex function on (0, oo) where f[t) = — logf sharpens inequality ([94]) 
under the assumption of bounded relative information. 

Theorem 14: Fix P <ZC7> Q such that (fa, fa) £ (0, l) 2 . Then, 


fa D(Q\\P) < log(l + x\P\\Q)) ~ D(P\\Q) 
<(3f l D(Q\\P). 


(241) 

(242) 
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V. Total Variation Distance, Relative Information Spectrum and Relative Entropy 


A. Exact Expressions 

The following result provides several useful expressions of the total variation distance in terms of the relative 
information. 

Theorem 15: Let and let X ~ P and Y ~ Q be defined on a measurable space (A, -P). Thcnp] 


\P-Q\ =E[|l — exp(zp|| Q (y))|] (243) 

= 2E[(l-exp(z P || Q (Y))) + ] (244) 

= 2E[(l - exp(z P || Q (Y))) - ] (245) 

= 2E[(l-exp(-z P || Q (X))) + ] (246) 

= 2 (P[»p||Q(X) > 0] - P[* P ||q(Y) > 0]) (247) 

= 2 (P[»p|| Q (Y) < 0] -P[zp|| Q (X) < 0]) (248) 

= 2 / 1 P[z P || Q (Y) < log /3] d/3 (249) 

Jo 

r 1 r li 

= 2 / E ip^q(X) > log — d/3 (250) 

Jo 1 P ] 

rfc 1 

= 2 J /r 2 [:l-Fp||g(l 0 g/?)] d/3. (251) 

Furthermore, if P O Q, then 


\P — Q\ = 2E[(l - exp(—* P || q(X))) ] 
= E[|l — exp(—* P || q(X))|] . 


Proof: See Appendix [P| 


(252) 

(253) 


Remark 15: In view of ( |247[ ), if P <C Q, the supremum in ( |58[ ) is a maximum which is achieved by the 
event 


T* = {a e A: Ip\\q(cl ) > 0} E &. 


(254) 


Similarly to Theorem 15 the following theorem provides several expressions of the relative entropy in terms 
of the relative information spectrum. 


11 ( z) + = z 1 {z > 0} = maxjz, 0}, and (z) = —z 1 {z < 0} = max{— z, 0}. 
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Theorem 16: If D(P\\Q) < oo, then 


poo pO 

D ( p \\Q)= (1 - f p || q («)) da- / F P || Q (a)da 

J 0 J—oo 

= l°° l-Fp||g(log/3) ^ _ j'l Fp|| Q (log/3) ^ 

POO pO 

= / P[?p||q(F) > a] ae Q da — / P[zp||g(F) < a] ae a da 


(255) 

(256) 

(257) 


where Y ~ Q, and for convenience ( |256| ) and ( |257[ ) assume that the relative information and the resulting 
relative entropy are in nats. 

Proof: The expectation of a real-valued random variable V is equal to 


poo p 0 

E[V\ = F[V>t]dt- F[V < t ] d t 

J 0 J —oo 


(258) 


where we are free to substitute > by >, and < by <. If we let V = *p||g(X) with X ~ P, then ( |258[ ) yields 
( |255[ ) provided that D(P\\Q) = E[F] < oo. 

Eq. ( |256[ ) follows from ( |255[ ) by the substitution a = log 3 when the relative entropy is expressed in nats. 
To prove (257]), let Z = i p \\q{Y) with Y ~ Q, and let V = r(Z) where r: (0, oo) —> [0, oo) is given in 


( |43| ) with natural logarithm. The function r is strictly monotonically increasing on [l,oo), on which interval 
we define its inverse by si: [0, oo) —> [l,oo); it is also strictly monotonically decreasing on (0,1], on which 
interval we define its inverse by S 2 : [0,1] —> (0,1], Then, only the first integral on the right side of ([258 ) can 
be non-zero, and we decompose it as 


D(P\\Q)= F[Z>l,r(Z)>t]dt+ F[Z < l,r(Z) > t]dt 
Jo Jo 

POO P 1 

= / F[Z > si(t)] dt+ F[Z < s 2 (t)] dt 


F[Z > v ] \og e vdv — / F[Z < v\ \og e vdv 


(259) 

(260) 


J 1 jo 

where ( |260[ ) follows from the change of variable of integration t = r(v). Upon taking log e on both sides of the 
inequalities inside the probabilities in ( |260| ), and a further change of the variable of integration v = e“, ( |260[ ) 
is seen to be equal to (|257[). ■ 


B. Upper Bounds on \P — Q\ 

In this section, we provide three upper bounds on \P — Q\ which complement (JTJ) . 
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Theorem 17: If P <C Q and X ~ P, then 

I P - Q\ loge < D(P\\Q) + E[|* P ||q(X)|] . 

Proof: For every z G [—cx),cx)], 

(1 - exp(-z)) 1 {z > 0 } < H* > 0 }- 

Substituting z = ip\\q(X), taking expectation of both sides of ( |262[ ), and using ( 246 1 give 

\P - Q\ loge < 2E[?p||q(X) 1{i p \\q(X) > 0}] 

= ^[*p||q(^Q + |*p||qP0I] 

= D(P\\Q)+E[\i PII q(X)\1 


(261) 

(262) 

(263) 

(264) 

(265) 


Remark 16: Theorem 17 is tighter than Pinsker’s bound in [|78l (2.3.14)]: 

\P-Q\ log e < 2E [|*p,|q(X)|] . 


The second upper bound on \P — Q\ is a consequence of Theorem 15 
Theorem 18: Let with (/h. 82) G [0, l) 2 . Then, for every fj G [fi \, 1], 

\\p-Q\ <(i-Aj)p[*phqW >0] 

+ Wo - Pi) P *p||qP0 > log ^ 

where X ~ P, and, for every ff) G [lh, 1], 

i|P-Q|<(l-^o)P[z P || Q (F)<0] 


(266) 


(267) 


(268) 


+ (/9o - A 2 )P[*P||q(^) < log Po] 

where Y ~ Q. Furthermore, both upper bounds on \P — Q\ in ( 267[ ) and ( |268| ) are tight in the sense that they 
are achievable by suitable pairs of probability measures defined on a binary alphabet. 

Proof: Since the integrand in the right side of ( |250[ ) is monotonically increasing in j3, we may upper bound 
it by P i p \\q{X) > log H when f3 G [/3i,/?o], and by P[*p||qP0 > 0] when j3 G (Po, 1], The same reasoning 
applied to (249]) yields ([268 1. 
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To check the tightness of ( |267[ ) for any 0i E (0,1], choose an arbitrary rj E (0, f) and a pair of probability 
measures P and Q defined on the binary alphabet A = {(), 1} with 

1 — rj 


P( o) = 


(269) 

(270) 


1 - 7701 

Q(o) = PiP(o). 

Then, we have *p||q(0) = log^-, i p \\q{1) = log?y < 0, and both sides of ( 267 ) are readily seen to be equal 
when 0o = 01- The tightness of ( |268[ ) can be shown in a similar way. ■ 

The third upper bound on \P — Q\ is a classical inequality li55l (99)], usually given in the context of bounding 
the error probability of Bayesian binary hypothesis testing in terms of the Bhattacharyya distance. 

Theorem 19: 


\\P-Q\ 2 <1- exp(— D±(P\\Q)). 


(271) 


Remark 17: The bound in ( 271| ) is tight if P, Q are defined on {0,1} with P( 0) = Q( 0) or P(0) = Q{ 1). 
In view of the monotonicity of D a (P\\Q) in a, Theorem p~9] yields ([4]), which is equivalent to the Bretagnole- 
Huber inequality IT5l (2.2)] (see also |[107l pp. 30-31]). Note that Q is tighter than (JT|) only when \P — Q\ > 
1.7853. 


C. Lower Bounds on \P — Q\ 

In this section, we give several lower bounds on \P — Q\ in terms of the relative information spectrum. 


Furthermore, in Section VI we give lower bounds on \P — Q\ in terms of the relative entropy (as well as other 
features of P and Q). 

If, for at least one value of 0 E (0,1), either P i p \\q{X) > log ^ or P \i p \\q(Y) < log0] ai'e known then 
we get the following lower bounds on \P — Q\ as a consequence of Theorem [If 
Theorem 20: If P <P Q then, for every 0o E (0,1), 


\P — Q\> 2(1 - 0 O ) P[?p>||q(V) < lo g 0o], 

1 ' 


\p Q\ — 2(1 0 o): 


^P||q(^) > log 


00 


(272) 

(273) 


with X ~ P and Y ~ Q. 
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Proof: The lower bounds in ( |272[ ) and ( |273[ ) follow from ( |249 [ ) and ( |250[ ) respectively. For example, from 
( |249[ ), it follows that for an arbitrary Bq G (0,1) 


i|P-Q| = / P[z P || Q (y)<log/3]d/3 


(274) 


> / P[r P || Q (y)<log/?]d/3 (275) 

J Bo 

> (l-^ 0 )P[*P||o(y) <log/3o] (276) 


where (276]) holds since the integrand in ( 275[ ) is monotonically increasing in B G (0,1], ■ 

Next we exemplify the utility of Theorem [20] by giving an alternative proof to the tight lower bound on the 
relative information spectrum, given in Il68l Proposition 2] as a function of the total variation distance. 
Proposition 2: Let P <C Q. then for every B > 0 

o, B € (o, 2-|P-Q|] > 


^pye^og/ 3 ) > 


l - P\ p -Q\ R g ( _2_ \ 

1 2(3-1) > P \2—IP—Q| 1 °°)- 


(277) 


Furthermore, for every fi > 0 and 5 G [0,1), the lower bound in ( 277 ) is attainable by a pair ( P,Q) with 
\P -Q\ = 26. 


Proof: Since P < Q, they cannot be mutually singular and therefore \P — Q\ < 2. From ( |27[ ) and ( |273[ ) 

( log i 


(see Theorem 20), it follows that for every /3o £ (0,1) 


\\P~Q\ > (1-A)) 

Consequently, the substitution f) = 4^ > 1 yields 


1 - F 




FpiioOogfflai-Hf^r 


/3o^J 


(278) 


(279) 


which provides a non-negative lower bound on the relative information spectrum provided that (3 > 2 ~\p~q\ ■ 
Flaving shown ( |277[ ), we proceed to argue that it is tight. Fix 5 G [0,1) and let \P — Q\ = 26, which yields 
2_|p-Q| = i n the right side of ( |277| ). 

• If B < let the pair (P, Q) be defined on the binary alphabet {0,1} with P( 1) = 1 and Q( 1) = 1 — 5 
(thereby ensuring 25 = \P — Q |). Then, from ( |27] ), 


F P || Q (log/3) = P(0) = 0. (280) 

• If B let t > B and consider the probability measures P = P T and Q = Q r defined on the binary 

alphabet {0,1} with P r (l) = and Q T ( 1) = (note that indeed 26 = \P T — Q r |). Since 1 < B < T 
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then 


Fp||o(lo g/ 9) = P r (0) = l — ^ 


which tends to 1 — Mr in the right side of ( 277| ) by letting r I /3. 


(281) 


Attained under certain conditions, the following counterpart to Theorem [18] gives a lower bound on the 
total variation distance based on the distribution of the relative information. It strengthens the bound in 1 1091 
Theorem 8 ], which in turn tightens the lower bounds in li78l (2.3.18)] and Il98l Lemma 7]. 

Theorem 21: If P O Q then, for any 771,772 > 0, 


\P-Q\>(1- exp(- 77 i)) P[,p|| 0 (X) > 771 ] + (exp( 772 ) - l) P[*p||qP0 < ~m\ 


(282) 


with X ~ P. Equality holds in ( 282 1 if P and Q are probability measures defined on {0,1} and, for an arbitrary 

m,m > 


m 


1 - exp (— 772 ) 

1 - exp (—771 - 772 ) ’ 


Q{ 0 ) = exp(- 77 i)P( 0 ). 


(283) 

(284) 


Proof: From ( |253[ ), it follows that for arbitrary 771,772 > 0, 

\P-Q\ >E[|l — exp(— 7 p| !Q ( 3 f))| 1 {zp||q(X) > 771 }] 

+ E[|l-exp(- 7 P || Q (A))| l{zp|| Q (X) <- 772 }] (285) 

which is readily loosened to obtain ( [282[ ). Equality holds in ( 282 ) for P and Q in the theorem statement since 
i P \\ Q (X) only takes the values log = r?1 an d i og ^1) = ■ 

The following lower bound on the total variation distance is the counterpart to Theorem [T7] 

Theorem 22: If P <XX Q, and A ~ P then 

I P ~ Q\ loge > E[| 7 p|| Q (A)|] - D[P\\Q). (286) 


Proof: We reason in parallel to the proof of Theorem 17 For all 2 G [— 00 , 00 ], 


[l - exp(-z)] > 


Ml. 

log e 


(287) 
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Substituting z = i P \\ q{X), taking expectation of both sides of ( |287[ ), and using ( |252| ) we obtain 

\P — Q\ loge > 2E[(*p,|q(X) ) _ ] (288) 

= E[|* P ||q(X)|-z P ||q(A:)] (289) 

= E[|* P ||q(X)|] -D(P\\Q). (290) 


Remark 18: The combination of Pinsker’s inequality (jT|) and ( |286| ) yields the following inequality due to 
Barron (see EIp. 339]) which is useful in establishing convergence results for relative entropy (e.g. 0) 


E[|zp||q(X)|] < D{P\\Q) + s/2D{P\\Q) loge 


with X ~ P. 


(291) 


D. Relative Entropy and Bhattacharyya Distance 

The following result refines ([3]) by using an approach which relies on moment inequalities |[95l - ll97l . The 
coverage in this section is self-contained. 

Theorem 23: If P <C3> Q, then 

|(x 2 (P||Q)) 2 loge 


D(P\\Q)<log{l + X 2 (P\\Q)) - 


(292) 


(1 + xHQWP)) (i + x 2 (P||Q)) 2 -i‘ 

Furthermore, if {P n } converges to Q in the sense of ( |206| ), then the ratio of D(P n \\Q) and its upper bound in 
\292\ tends to 1 as n —>• oo. 


Proof: The derivation of ( |292[ ) relies on 1 95 1 Theorem 2.1] which states that if IF is a non-negative random 
variable, then 


(e[W“]-E“[W]) loge 
a(a— 1) 




a/0,1 

log(E[W]) — E[logFF], a = 0 

E[WlogW] — E[VF] log(E[W]), a = l 


(293) 


is log-convex in a G i. 


To prove (292), let W = (X) with X ~ P, then (293]) yields 


A 0 = log(l + x 2 (P\\Q)) ~ D(P\\Q), 

1 


A_n' — 


a(a + 1) . 


1 + (a - 1)^(Q||P) - (1 + x 2 (^HQ))" 


log e 


(294) 

(295) 
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for all a > 0, and specializing ( |295[ ) yields 

= x 2 (P\\Q) log e 
“ 1 2(1 + x 2 (P\\Q)) 


X—2 — — 
6 


i + x 2 (Q\\P)- 


(i + x 2 (p\\Q)) \ 


log e. 


In view of the log-convexity of A Q in a € R, then 


Ao A—2> A_ x 


(296) 

(297) 


(298) 


which, by assembling ( |294 )—( 298] ), yields ( |292 ). 

Suppose that { P n \ converges to Q in the sense of ( 206 1. Then, it follows from Theorem 12 and Corollary [3] 
that 


lim D(P n \\Q) = 0, 

n—>oc 

(299) 

lim x 2 {Pn\\Q) = 0, 

n— >■ oo 

(300) 

Km D ( p "»0) = 1 1„„ 
n ^°° X 2 (Pn\\Q) 2 g ’ 

(301) 

lim XiQWPn) _ , 

^x 2 (P n \\Q) ' 

(302) 


Let U n denote the upper bound on D(P n \\Q ) in ( |292[ ). Assembling ( |299 )-( 302 1, it can be verified that 

U n 


D(P n \\Q) 


= 1. 


(303) 


Remark 19: In view of (299>—(302 1 , while the ratio of the right side of ( 292 ) with P = P n and D(P n \\Q) 
tends to 1, the ratio of the looser bound in ([3]) and D(P n \\Q) tends to 2. 


Remark 20: If {P n } and Q are defined on a finite set A , then the condition in ( 206 ) is equivalent to | P n ~Q\ 

0 with Q(a) > 0 for all a G A. 

Remark 21: An alternative refinement of ([3]) has been recently obtained in lf97il as a function of X 2 (P\\Q) 
and the Bhattacharyya distance B(P\\Q) (see Definition [3]): 

exp(— B(P\\Q)) y/1 + x 2 (P\\Q) ~ ll loge 


D(P\\Q)<log(l + X 2 (P\\Q)) ~ 


32 


9 x 2 (P\\Q) 

Eq. ( |304[ ) can be generalized by relying on the log-convexity of A Q in a € R, which yields 


(304) 


^1— cx. \a 


A“j > A_ c 


(305) 
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for all a E (0,1); consequently, assembling ( |294[ ), ( |295| ), ( |296[ ) and ( |305[ ) yields 
D(P||Q)<log(l + X 2 (P||Q)) (306) 

“ {x 2 (P\\Q)) T ^ [(l-(l-a)J£(Q||P)) (l + X 2 (P||Q)) Q -ll^ loge 


v a (a + 1) 

for all a 6 (0,1). Note that in the special case a = |, ( |306| ) becomes ( |304| ), as can be readily verified in view 


of (83) and the symmetry property M\(P\\Q) = Jifi(Q\\P). 

Remark 22: The following lower bound on the relative entropy has been derived in |97l , based on the approach 
of moment inequalities f* 2 ] 

6 [l - exp(-2P(P||Q))]“ loge 


D(P\\Q)>2B(P\\Q) + 


l_ exp (— 4B(P\\Q))+ X 2 (Q\\Py 


Note that from (82) 


B{P\\Q) > \ log 


1 


1-\\P-Q\‘ 


(307) 


(308) 


and since the right side of ( 307 1 is monotonically increasing in B(P\\Q), the replacement of B(P\\Q) in the 
right side of ( |307| ) with its lower bound in ( |308| ) yields 

1 \ !|P-(5| 2 loge 


D(P\\Q) > log 


+ 


i-\\p-QV) 1 i-i\ P -Q\2 + ^mn 


(309) 


2 £ 


Although ( |309[ ) improves the bound in (|4]), it is weaker than ( 307 1 , and it satisfies the tightness property in 
Theorem [23] only in special cases such as when P and Q are defined on A = {0,1} with P(0) = Q( 1) = | 
and we let e —> 0. 

Define the binary relative entropy function as the continuous extension to [0, l] 2 of 

1 — x 


d(x\\y) = a; log + (1 - x) log ^ 


1 - y 


(310) 


The following result improves the upper bound in (] 181 


a) 


Theorem 24: Let P Q with {d\ ■ fo) G (0, l) 2 . Then, 

x 2 (p||Q) < {PT 1 — i)(i — fo), 

which is attainable for binary alphabets. 


(311) 


l: For the derivation of j307| i for a general alphabet, similarly to (97), set W = ^J ^ (X) in ( |293| > with X ~ P, and use the inequality 
Ao A4 > A| which follows from the log-convexity of X a in a. 
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b) 


D(P\\Q) < mini log(l 


+ c - 


c log e 


l 1 + (l + P 2 X ) (1 + c ) 

(\/c 2 + 8 ^ 2 clog e (l + c) - c) loge 

^2 ’ 

ft -1 -! / 3 2 - 1 (/ 3 1 ~ 1 - 1 ) 


/V/v -1 /V/V -1 

where we have abbreviated c = x 2 (-P||Q) for typographical convenience. 


Proof: To prove ( |311 [ ), we first consider the case where P , Q are defined on El 
= /3p. Straightforward calculation yields 

m = JC- 1 1 , Q(0) = /3 2 - 1 P(0) 

Pi p 2 1 

and 


(312) 

(313) 

(314) 


{ 0 , 1 } and 


P(0) 

0 ( 0 ) 


02, 


(315) 


X 2 (P\\Q) = (A _1 - 1)(1 -/S 2 )- (316) 

In the case of a general alphabet, consider the elementary bound with a < 0 < 6 : E[Z 2 ] < —ab which holds 
for any Z E [a, 6 ], E [Z\ = 0, and follows simply by taking expectations of 

Z 2 = -ab+Z(a + b) - (Z - a)(b - Z) (317) 

<— ab + Z{a + b). (318) 


Since x 2 (P\\Q) = E[Z 2 ], ( |3 11 1 ) follows by letting a = /3 2 — 1, b = j3 l — 1 and 

To prove ( 312 ), note that it follows by combining ( |292| ) with the left side of the inequality 

. 2 ^, 0 ^ X 2 (P\\Q) 2 , 


fax (Q\\ p ) < 


1 + x 2 {P\\Q) 


<PVx 2 {Q\\P) 


where ( |32Q[ ) follows from Theorem 13 with f(t ) = J — 1 for t > 0. 

To prove ( 313 ), note that assembling ([ 8 ]) and ( |241[ ) yields 

2p 2 D 2 (P\\Q) 


D{P\\Q) < log(l + X 2 {P\\Q)) ~ 


X 2 (P\\Q) loge 


(319) 


(320) 


(321) 


and solving this quadratic inequality in D(P\\Q), for fixed x (P\\Q), yields the bound in (3131. 
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Bound ( |314[ ) holds since the maximal D(P\\Q), for fixed x 2 (P\\Q)- is monotonically increasing in x 2 (P\\Q)- 
In view of ( 311 [ >, D(P\\Q) cannot be larger than its maximal value when X 2 (P\\Q) = (/Sf 1 — l)(l — fa)- In 
the latter case, the condition of equality in < |3 18 > (recall that E [Z\ = 0) is 

b 


P [Z = a] = 


b — a 


= 1 - P [Z = b\ 


(322) 


which implies that the maximal relative entropy D(P\\Q) over all P <CS> Q with given (fa, fa) G (0, l ) 2 is 
equal to 

E[(l + Z) log(l + Z)] (323) 

6(1 + a) log(l + a) - a( 1 + 6 ) log(l + 6 ) 


= d 



.P1P2 - 1 P1P2 ~ 1 


(324) 

(325) 


where ( 323 >—( 325] ) follow from ( 310 > and < |322[ > with a = fa — 1 and 6 = /Tf 1 — 1. ■ 

Remark 23: The proof of ( |312[ ) relies on the left side of ( |320[ ); this strengthens the bound which follows 
from Theorem [ 6 j given by X 2 (Q\\P) < /Tf 1 X 2 (P\\Q)- The bound ( 31 3| ) is typically of similar tightness as 
the bound in ( |312| ), although none of them outperforms the other for all (fa, fa) G ( 0 , l ) 2 and X 2 (P\\Q) £ 
[ 0 , {^ l -l)(l-fa)\ (see mb 

Remark 24: The left inequality in ( 181 1 and Theorem [24] provide an analytical outer bound on the locus of 
the points (x 2 (P\\Q), D(P\\Q)) where P O Q and fa < ^ for given (fa, fa) G (0, l) 2 . 

Example 12: In continuation to Remark 24 for given (fa, fa) G (0, l) 2 , Figure [T] compaies the locus of the 


points (x 2 (P\\Q),D(P\\Q)) when P,Q are restricted to binary alphabets, and £ is bounded between fa and 


fa , with an outer bound constructed with the left inequality in (| 181 [) and Theorem 24 (recall that the outer 


bound is valid for an arbitrary alphabet). 

The following result relies on the earlier analysis to provide bounds on the Bhattacharyya distance, expressed 
in terms of fa 2 divergences and relative entropy. 

Theorem 25: If P <C^> Q, then the following bounds on the Bhattacharyya distance hold: 


<B(P\\Q) 

< \ log(l + X 2 (P\\Q)) - log I 1 + 


ixHPWQ) 

2 log e 

iog(i + x 2 (P\\Q)) - D (P\\Q) 


{\x 2 (P\\Q )) 1 \ 


^(i + x 2 (P\\Q)f(i + x 2 (Q\\P))-i, 


(326) 


(327) 
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D(P\\Q) 

nats 



10 12 14 16 18 

x 2 (P\\Q) 


Fig. 1. Comparison of the locus of the points (x 2 (P\\Q)i D(P\\Q)) when P, Q are defined on A = {0,1} with the bounds in the 


left side of m\ (a) and Theorem 24 1 (b); (/3i,) = (§, |) and (/3i,/? 2 ) = (^, fj) in the upper and lower plots, respectively. The 
dashed curve in each plot corresponds to the looser bound in 0- 


Furthermore, if {P n } converges to Q in the sense of ( 2061 , then the ratio of the bounds on B(P n \\Q) in ( 3261 
and ( |327[ ) tends to 1 as n —> oo. 

Proof: In view of the log-convexity of A a in a E K, 


A 0 A_i > A_i, A_i A_2 > A_! 


(328) 

, ( |294[ ), ( |295[ ) and ( |328[ ) yield 


for any choice of the random variable W in (293|. Consequently, assembling 
the bounds on B{P\\Q) in (326) and ( |327 1. 

Suppose that {P n } converges to Q in the sense of ( |206| ). Let L n and U n denote, respectively, the lower and 
upper bounds on B(P n \\Q) in (326]) and (327]). Assembling ( |299 >—( 302 ), it can be easily verified that 

U n 


f™ ^~' n 

n^X 2 (Pn\\Q) 


h^'XHPnWQ) 


= 5 log e 


(329) 


which yields that lim^-yoo jf- = 1. 


Remark 25: Note that ( |327| ) refines the bound 

B{P\\Q) < \ log(i + x 2 (p\\Q)) 


(330) 


which is equivalent to A_i > 0 (in view of Jensen’s inequality, ( [83] ) and ( |295| )). 

Remark 26: Let {P n } converge to Q in the sense of ( |206[ ). In view of ( |329| ), it follows that 

B(Pn\\Q) 


lim 


n^oo X 2 {P n \\Q) 8 


= 5 log e, 


(331) 
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from which we can surmise that both upper bounds in (|292|) and (|304[) are tight under the condition in (|206[) 


(see Theorem 231, although ( |292| ) only depends on ^-divergences. In view of ( |33 1 [ ), the lower bound in ( 307 1 
is also tight under the condition in ( |206| ), in the sense that the ratio of D(P n \\Q) and its lower bound in ( 307 1 
tends to 1 as n —> oo; this sufficient condition for the tightness of (|307[) strengthens the result in Il97l Section 4]. 


VI. Reverse Pinsker Inequalities 

It is not possible to lower bound \P — Q\ solely in terms of D(P\\Q) since for any arbitrarily small e > 0 
and arbitrarily large A > 0, we can construct examples with \P — Q\ < e and A < D(P\\Q) < oo. Therefore, 
each of the bounds in this section involves not only D(P\\Q) but another feature of the pair (P. Q). 


A. Bounded Relative Information 

As in Section |IV} the following result involves the bounds on the relative information. 

Theorem 26: If f3\ 6 (0,1) and £ [0,1), then, 

D(P\\Q) < I (rtPr 1 ) - I p - Q\ (332) 


where ip: [0, oo) —> [0, oo) is given by 


0 


t = 0 


¥>(<) = ^ f G (0,1) U (1, oo) 


t-i 

log e t = 1. 


(333) 


Proof: Let X ~ P, Y ~ Q, and Z be defined in ([32]). The function <p: [0, oo) —> [0, oo) is continuous, 
monotonically increasing and non-negative; the monotonicity property holds since (t — 1 ) 2 <p'(t) = (t—1) loge — 
log t > 0 for all t > 0, and its non-negativity follows from the fact that tp is monotonically increasing on [0, oo) 
and </>(0) = 0. Accordingly, 

¥>(&) < <P(Z) < piPf 1 ) (334) 

since (32) and (151 )-(l52]> imply that Z G [/3a, with probability one. The relative entropy satisfies 


D(P\\Q)=E[Z log Z] 

= E[<p(Z)(Z-l)] 

= E [<p(Z) (Z - 1) 1 {Z > 1}] + E [<p{Z) (.Z - 1) 1{Z < 1}]. 


(335) 

(336) 

(337) 
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We bound each of the summands in the right side of (|337[) separately. Invoking (|334[), we have 


E [v(Z) (Z - 1) 1{Z > 1}] < <p(^)E [(Z - 1) 1{Z > 1}] 

= ^(/3 1 - 1 )E[(l -Z)~] 

= \ip{Pi 1 )\P-Q\ 

where ([339 ) holds since (x)~ = -xl{x < 0}, and ( 340 > follows from ( 245 ) with Z in 


yields 


E[tp(Z) (Z - 1) 1 {Z < 1}] < <p((3 2 )E[(Z - 1) 1{Z < 1}] 

= -*>(&) E[(l -Z)+] 

= \P-Q\ 


(338) 

(339) 

(340) 

Similarly, ( |334[ ) 

(341) 

(342) 

(343) 


where ( |343[ ) follows from ( |244| ). Assembling ( |337| ), ( |340[ ) and ( |343[ ), we obtain ( |332[ ). 

Remark 27: By dropping the negative term in (332]), we can get the weaker version in [ 109 , Theorem 7]: 

l °£jk 


D(P\\Q) < 


\P~QV 


(344) 


2(1 - /5?i)^ 

The coefficient of \P — Q\ in the right side of (344]) is monotonically decreasing in (i\ and it tends to ^ loge by 
letting j3i —> 1. The improvement over ( |344[ ) afforded by ( |332[ ) is exemplified in Appendix [E] The bound in (344]) 
has been recently used in the context of the optimal quantization of probability measures |[T2l Proposition 4], 
Remark 28: The proof of Theorem [26] hinges on the fact that the function ip is monotonically increasing. It 
can be verified that ip is also concave and differentiable. Taking into account these additional properties of ip, 
the bound in Theorem [26] can be tightened as (see Appendix [F]): 

D(P\\Q) <! ((^r 1 ) - ¥>(&) - v'iPi 1 ) Pi 1 ) \P-Q\ + vXPi 1 ) E[Z(Z - 1) 1{Z > 1}] (345) 

which is expressed in terms of the distribution of the relative information. The second summand in the right 
side of ( |345[ ) satisfies 

X 2 (P\\Q) + ^\P-Q\< E [Z(Z - 1) \{Z > 1}] (346) 

<X 2 {P\\Q) + \\P-QV (347) 

From (32), ( 245 l and Z <G [fa. Pi 1 ], the gap between the upper and lower bounds in ( 347 1 satisfies 

\{l - P 2 )\P - Q\ = (1 — Pi) E [(1 — Z) + ] (348) 


< (1 - P 2 ? 


(349) 
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which is upper bounded by 1, and it is close to zero if 


1. The combination of (345 > and (347 > leads to 


d(p\\q) < i (Wr 1 ) - vm+ ^r 1 ) (i - Pi 1 )) \p-q\ + v'Wi 1 ) • x\p\\q)- 050 ) 


Remark 29: A special case of ( |344| ), where Q is the uniform distribution over a set of a finite size, was 
recently rediscovered in ll58l Corollary 13] based on results from lt52ll . 

Remark 30: For e G [0, 2] and a fixed probability measure Q, define 


D*{e,Q)= inf D(P\\Q). (351) 

P: \P-Q\>£ 

From Sanov’s theorem (see fl20l Theorem 11.4.1]), D*(e,Q) is equal to the asymptotic exponential decay of 
the probability that the total variation distance between the empirical distribution of a sequence of i.i.d. random 
variables and the true distribution Q is more than a specified value e. Bounds on D*(e,Q) have been shown 
in iflOl Theorem 1], which, locally, behave quadratically in e . Although this result was classified in iflOl as 
a reverse Pinsker inequality, note that it differs from the scope of this section which provides, under suitable 
conditions, lower bounds on the total variation distance as a function of the relative entropy. 


B. Lipschitz Constraints 

Definition 6: A function /: B — > M, where B C M, is L-Lipschitz if for all x, y G B 


\f(x)-f(y)\<L\x-y\. (352) 

The following bound generalizes |[35l Theorem 6] to the non-discrete setting. 

Theorem 27: Let P « Q with d\ G (0,1) and @2 £ [0,1), and /: [0,oo) H> M be continuous and convex 
with /(1) = 0, and L-Lipschitz on Then, 


D f (P\\Q)<L\P-Q\. 


Proof: If Y ~ Q, and Z is given by ([32]) then /(1) = 0 yields 


D f (P\\Q)=K[f(Z)\ 

<E[|/(Z)-/(1)|] 
<LK[\Z- l|] 

= L \P — Q\ 


where (357 1 holds due to (2431. 


(353) 

(354) 

(355) 

(356) 

(357) 
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Note that if / has a bounded derivative on 1 ] 5 we can choose 


L = sup |/'(f)| < 00 . 

t£[/3 2,ft 1 ] 


Remark 31: In the case f(t) = tlogt, /(0) = 0, ( |358[ ) particularizes to 

L = rnax{ |log(e/3 2 ) |. log^f 1 )} 

resulting in a reverse Pinsker inequality which is weaker than that in (332]) by at least a factor of 2. 


(358) 


(359) 


C. Finite Alphabet 

Throughout this subsection, we assume that P and Q are probability measures defined on a common finite 
set A, and Q is strictly positive on A, which has more than one element. 

The bound in ( |344[ ) strengthens the finite-alphabet bound in ll38l Lemma 3.10]: 

D(P\\Q) < log (-*—) -\P-Q\ (360) 

\ Vmin / 

with 


Qmin = min Q(a) < L 


a£A 


(361) 


To verify this, notice that /3i > Q m i n . Let v: (0,1) —> (0,oo) be defined by v(t) = j/t • log]-; since v is a 
monotonically decreasing and non-negative function, we can weaken ( |344[ ) to write 

lo § cT~ 


D(P\\Q) < 


< log 


2(i-Q min ; 

1 


Q r 


J\P-Q\ 

\P-Q\ 


(362) 

(363) 


where (363) follows from (361). 


The main result in this subsection is the following bound. 
Theorem 28: 

D(P\\Q) < log f 1 + 


I P-Qf 


2 Q 

min 

Furthermore, if Q <C P and /3 2 is defined as in ( 150|), then the following tightened bound holds: 


D(P\\Q) <log 1 + 


I P~Q\‘ 

2 Q 

min 


P 2 log e 


\P-Q\ 2 . 


(364) 


(365) 
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Proof: Combining ([5]) and the following finite-alphabet upper bound on X 2 {P\\Q) yields ( |364[ ): 

(P(a) - Q(a)) 2 


QminX 2 (P\\Q) = Yl 


Q{a)/Q r 


ae.4 

<^(P(a)-Q(a))' 


(366) 

(367) 


< max \P(x) — Q(x)\ Y\P(a)-Q(a)\ 

x&A ' 1 1 

ae.4 


= \P — Q\ max | P(a) — Q(a ) 

aeA 


<\\p-Q \ 2 - 

If P <CS> Q, then (365 l follows by combining ([369 ) and 

X 2 (P\\Q) > ex P (p(P||Q) + fa D(Q\\P )) 


- 1 


> exp (D(P\\Q) + \ \P - Q\ 2 fa loge) - 1 


(368) 

(369) 

(370) 

(371) 


where ( 370 ) follows by rearranging ( 241 ), and ( |37 1 [ ) follows from ([!]). ■ 

Remark 32: It is easy to check that Theorem [28] strengthens the bound by Csiszar and Talata ([23]) by at least 
a factor of 2 since upper bounding the logarithm in ( ]364| ) gives 


P(P||Q)<^^-|P-Q| 2 . 


(372) 


Remark 33: In the finite-alphabet case, similarly to ( |364| ), one can obtain another upper bound on I){P\\Q) 
as a function of the fa norm ||P — Q\\ 2 m - 


D(P\\Q ) < 


1 


Qn 


I P - Q III i°g e 


(373) 


which appears in the proof of Property 4 of if 1001 Lemma 7], and also used in 071 (174)]. Furthermore, similarly 

fa log e 


to ( |365| ), the following tightened bound holds if P <C3> Q' 

\p-Q\\1 


D{P\\Q) <log 1 + 


I^-Qll2 


(374) 


Q min ) 2 

which follows by combining ([367 ), ( 371 [ ), and the inequality ||P — II 2 < |P — Q\- 
Remark 34: Combining ([T]) and ( 369[ ) yields that if P f Q are defined on a common finite set, then 

D ipii % > Q min loge (375) 

x 2 {P\\Q) 

which at least doubles the lower bound in ll72l Lemma 6]. This, in turn, improves the tightened upper bound 
on the strong data processing inequality constant in lT72l Theorem 10] by a factor of 2. 
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Remark 35: Reverse Pinsker inequalities have been also derived in quantum information theory (0, 01), 
providing upper bounds on the relative entropy of two quantum states as a function of the trace norm distance 
when the minimal eigenvalues of the states are positive (c.f. [|3] Theorem 6] and 0 Theorem 1]). When the 
variational distance is much smaller than the minimal eigenvalue (see (3l Eq. (57)]), the latter bounds have 
a quadratic scaling in this distance, similarly to ( |364| ); they arc also inversely proportional to the minimal 
eigenvalue, similarly to the dependence of ( |364| ) in Qmin- 

Remark 36: Let P and Q be probability distributions defined on an arbitrary alphabet A. Combining Theo¬ 
rems [7] and 26 leads to a derivation of an upper bound on the difference D(P\\Q) — D(Q\\P) as a function of 
\P—Q\ as long as P <CS> Q and the relative information ip, q is bounded away from — 00 and + 00 . Furthermore, 
another upper bound on the difference of the relative entropies can be readily obtained by combining Theorems [7] 
and [28] when P and Q are probability measures defined on a finite alphabet. In the latter case, combining 
Theorem [7] and (374]) also yields another upper bound on D(P\\Q) — D(Q\\P) which scales quadratically with 
|| P — Q || 2 . All these bounds form a counterpart to (5j Theorem 1] and Theorem [7] providing measures of the 
asymmetry of the relative entropy when the relative information is bounded. 


D. Distance From the Equiprobable Distribution 


If P is a distribution on a finite set A, H(P) gauges the “distance” from U, the equiprobable distribution 
defined on A, since H(P) = log \A\ — -D(P||U). Thus, it is of interest to explore the relationship between H(P ) 
and \P — U|. Next, we determine the exact locus of the points ( H(P ), \P — U|) among all probability measures 
P defined on A, and this region is compared to upper and lower bounds on \P — U| as a function of H(P). As 
usual, h(x) denotes the continuous extension of — x\ogx — (1 — x) log(l -j) to 1 £ [0,1] and d(x\\y) denotes 
the binary relative entropy in ( ]310[ ). 

Theorem 29: Let U be the equiprobable distribution on a {1,..., |A|}, with 1 < |^4| < 00 . 
a) For A e (0,2(1 - lAI- 1 )^ 


max 

P: |P—U|=A 


H(P) = log |.A| - lpin d (|^| + ± A || ^ 


|A| 


where the minimum in the right side of (376 1 is over 


(376) 


m 6 {1,..., \A\ — A|^4|]}. 


(377) 


'’There is no P with \P — U| > 2(1 — \A\ J ). 
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Denoting such an integer by toa, the maximum in the left side of ( |376[ ) is attained by 

l^l' 1 + 2W ^ G { 1 , ... ,771a}, 

1 — 2(|.4|-m A )’ ^ € ( m A + 1, - ■ • , |-4|}- 


Pa(1) = 


(378) 


hk < 


(379) 


(380) 


b) Let 

f 0, k = 0 

h,(|^l| _1 /c) + klogk, k G {1,..., |.4.| — 2} 
log |^4|, fc=|.A|-L 

If H G [hk~i,hk) for k G {1,..., |M| — 1}, then 

min IP — U| = 2(l — (k + 9) l-TD 1 ) 

P: H(P)=H y ' 

which is achieved by 

' 1- {k-l + 0)\A\-\ 1 = 1 

\A\~\ £e{2,...,k}, 

6\A\~ l , i = k +1 

0, £ G {k + 2,..., |A|} 

where 6 G [0,1) is chosen so that = H. 

Proof: See Appendix [G] ■ 

Remark 37: For probability measures defined on a 2-element set A, the maximal and minimal values of 


P< k) (£) = 


(381) 


|P — Uj in Theorem 29 coincide. This can be verified since, if P(l) = p for p G [0,1], then |P — U| = |1 — 2p| 
and H(P) = h(p). Hence, if |A| = 2 and H(P ) = H G [0, log 2], then 

|P — U| = 1 — 2h~ 1 (H) (382) 

where h~ l : [0,log2] —> [0, \] denotes the inverse of the binary entropy function. 

Results on the more general problem of finding bounds on | H(P) — H(Q) j based on \P — Q\ can be found 
in ll20l Theorem 17.3.3], li52l . lf90l and 111 131 . Most well-known among them is 


\H{P) — H(Q)\ < \P - Q\ log 


\P-Q\ 


(383) 


which holds if P, Q are probability measures defined on a finite set A with P(a) — (5(a) | < \ for all a G A 
(see 111 101 . and ll27l Lemma 2.7] with a stronger sufficient condition). Particularizing ( |383[ ) to the case where 
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Q = U, and 


P(a) — < \ for all a e A yields 


H(P) > log |_4.| - \P- U| log 




IP - U| 


(384) 


a bound which finds use in information-theoretic security 
Particularizing ([!]), <(4]), and ( |364| ) we obtain 


H(P) < log |^4| - \ \P- U| 2 loge, 
P(P) < log |-4.| + log (l — j |P — U| 2 ) 
H(P) > log \A\ -logfl + |P- U| : 


(385) 

(386) 

(387) 


If either |„4| = 2 or 8 < |^4.| < 102, it can be checked that the lower bound on H(P) in ( 384 ) is worse than 
( |387| ), irrespectively of |P — U| (note that 0 < |P — U| < 2(1 — |*4,| -1 )). 

The exact locus of (H(P), \ P — U|) among all the probability measures P defined on a finite set A (see 
Theorem 29), and the bounds in ( |385[ )— ( 387 ) are illustrated in Figure [2] for A = 4 and \A\ = 256. For \A\ = 4, 
the lower bound in ( |387| ) is tighter than ( |384[ ). For |A| = 256, we only show ( |387| ) in Figure [2] as in this case 


( 384 ) offers a very minor improvement in a small range. As the cardinality of the set A increases, the gap 
between the exact locus (shaded region) and the upper bound obtained from ([385) and ( 386 ) (Curves (a) and 
(b), respectively) decreases, whereas the gap between the exact locus and the lower bound in (387|) (Curve (c)) 


increases. 


E. The Exponential Decay of the Probability of Non-Strongly Typical Sequences 
The objective is to bound the function 

L 5 (Q)= min D{P\\Q) (388) 

P^Ts{Q) 

where the subset of probability measures on ( A, AT) which are (5-close to Q is given by 

TsiQ) = {p: Voe A, \P(a)-Q(a)\<6Q(a)\. (389) 

Note that (ai,... ,a n ) is strongly (5-typical according to Q if its empirical distribution belongs to Ts(Q )■ Ac¬ 
cording to Sanov’s theorem (e.g. ll20l Theorem 11.4.1]), if the random variables are independent and distributed 
according to Q, then the probability that (Yj,..., Y n ), is not (5-typical vanishes exponentially with exponent 

Ls(Q). 
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012345678 


H(P) (bits) 


Fig. 2. The exact locus of ( H(P ), P — U|) among all the probability measures P defined on a finite set A, and bounds on \P — U| 
as a function of H(P) for |.4| = 4 (left plot), and \A\ = 256 (right plot). The point ( H(P ), |P — U|) = (0, 2(1 — |_4| -1 )) is depicted 
on the y-axis. In the two plots, Curves (a), (b) and (c) refer, respectively, to ( j385[ ), ( |386| ) and (387}; the exact locus (shaded region) 
refers to Theorem 1291 


To state the next result, we invoke the following notions from f76l . Given a probability measure Q, its balance 
coefficient is given by 


/3q = inf Q(F). 

Q(P)> I 


(390) 


The function <i>: (0, loge, oo) is a monotonically decreasing and convex function, which is given by 


f(p) = 


4(1-2 p) 

\ l°ge, 


log (V)’ P e (°’2)> 


(391) 


P = 


2 ' 
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Theorem 30: If Q nnn > 0, then 


(/>(!-/3 Q )Q 2 min 6 2 <L 5 (Q) 


< log (l + 2 QminS 2 ) 


(392) 

(393) 


where ( 393| ) holds if 6 < 


1 Q n 


Q n 


Proof: The following refinement of Pinsker’s inequality ([!]) was derived in lf76l Section 4]: 

f(l-(3 Q )\P-Q\ 2 <D(P\\Q). 


(394) 


Note that if Q m in > 0 then (3q < 1 — Q m in < 1, and ©(1 — /3q) is well defined and finite. If P f Ts(Q), the 
simple bound 


\P-Q\> 5Q n 


(395) 


together with ( 388 1 and (394]) yields ( |392 1 . 


The upper bound (|393[) follows from (|364[) and the fact that if 4 < 


inf \P-Q\=25Q I] 
PtT t (Q)' 


1 Q n 


Q n 


then 


(396) 


To verify ((396 1, note that for every P 0 Ts(Q), there exists a E A such that | P(a) — Q(a)| > 5Q(a), which 
implies that \P — Q\ > 2 5Q(a) > 26Q m - m , thereby establishing > in (396]). To show equality, let ao £ A 
be such that Q(ao) = Qmin, and let ai f ao; since by assumption 6 < 1 , we have Q(a i) + <5 Q m in < 

Qmax T & Qmin ^ !• Let 

(1 5 £) Qmin 0> — Oq 

Q(ai) + (6 + e) Q min a = a\ (397) 

Q(a) otherwise 

for a sufficiently small e > 0 so that ( |397[ ) is a probability measure. Then, P 0 Ts(Q) and \P — Q\ = 
2(5 + e) Q m in, which verifies the equality in ( |396[ ) by letting e | 0. ■ 


P(a) = 


Remark 38: If S < 


l -Q„ 


Q n 


the ratio between the upper and lower bounds in (3931, satisfies 


1 


loge log (1 + 2Qmin<5 2 ) 


Qn 


2 </>(!- Pq) \ Qmin 5 2 log e 


< 


Qn 


(398) 


where ( |398| ) follows from the fact that its second and third factors are less than or equal to 1 and 4, respectively. 
Note that both bounds in ([393]) scale like 5 2 for 5 « 0. 
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VII. The E 7 Divergence 


A. Basic Properties 


Generalizing the total variation distance, the E, y divergence in ( | 66 | ) is an /-divergence whose utility in 
information theory has been exemplified in lfT8Tl . ll68l . |[69ll . i TTOll . |[79l ,lf80l. f8TI . 

In this subsection, we provide some basic properties of the E~ f divergence, which are essential to Sec- 
VII-B VII-D The reader is referred to li68l Sections 2.B, 2.C] for some additional basic properties of the 


tions 


E~ divergence. We assume throughout this section that 7 > 1 . 


Let fcQ. The E 1 divergence in ( | 66 | ) can be expressed in the form 

E^PWQ) = F[z P]lQ (X) > log 7] - 7 P[ lP ||Q(y) > log 7 ] 
= max(P(E) — 7 Q(E)) 


(399) 

(400) 


where X ~ P and Y ~ Q, and ( |400[ ) follows from the Neyman-Pearson lemma. 

Although the E 1 divergence generalizes the total variation distance, E y (P\\Q) = 0 for 7 > 1 does not imply 
P = Q since in that case ( [67j ) is not strictly convex at t = 1 (see Proposition [TJ. This is illustrated in the 
following example. 

Example 13: Let 7 > 1, and let P and Q be probability measures defined on A = {0,1}: 


P(0) = 


1 + 7 


Q(0) = 


1 


(401) 


27 '7 

Since i p \\q(x) = log7 \{x = 0} — log 2 < log7 for all x £ A, ( |399| ) implies that E 1 {P\\Q) = 0. 

The monotonicity of the E^ divergence in 7 E [l,oo) holds since / 7 l (t) < f j2 (t) for all i > 0 with 
fy(t) = (t — 7 ) + and 71 > 72 > 1. Therefore, 

E-n(P\\Q) 


< 1 . 


(402) 


e 12 (p\\Q) 

Although Theorem 1 {T ) does not apply in order to prove that 1 is the best constant in ( |402| ), we can verify it 
by defining P and Q on A = {0,1} with P(0) = \ and Q(0) = e > 0. This yields that if 71 > 72 > 1, then 
for all e € ( 0 , A_), 

E iA p \\Q) _ 1 - 2^71 


E l2 (P\\Q) 1-2 £72 ’ 

yielding the optimality of the constant in the right side of ( |402[ ) by letting £ / 0 in ( |403[ ). 
From (|400[), the following inequality holds: If P <+ It -C Q, an d 71,72 > 1 then 


(403) 


v 7i72' 


(P\\Q) < E yi (P\\R) + 71 E~ 2 (R\\Q). 


(404) 
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Letting 71 = 1 in ( |404[ ) (see ( [ 68 ] )) and 72 = 7 yield 

E-yiPWQ) - E^{R\\Q) < \\P - R\. 


(405) 


Generalizing the fact that E\{P\\Q ) = \\P — Q\, the following identity is a special case of If47l Corollary 2.3]: 


while 


R-.P,QCR 

( 21 )] states that 


min Ji? 7 (P||f?) + E^Q\\R)} = (l - 7 + \\P - Q\)' 


(406) 


(407) 


(408) 


min Q {E^R\\P) + E^R\\Q)} > (l - 7 + \ \P ~ Q\) + , 
which implies, by taking R = P, 

(l- 1 + l\P-Q\) + < El {P\\Q). 

We end this subsection with the following result. 

Theorem 31: If P <C Q and Y ~ Q, then 

E [|exp(* P || Q (y)) - 7 |] = 2E 1 (P\\Q) + 7 - 1, (409) 

E [max{ 7 , exp(z P || Q (y))}] = 7 + E y (P\\Q), (410) 

E [min{ 7 , exp(z P || Q (y))}] = 1 - Ej(P\\Q). (411) 

Proof: The identity \z\ = 2(. 2 )+ — z, for all z£l, and (661 are used to prove ( 409 ): 

E [|exp(t P ||g(y)) - 7 1] = 2E (exp(* P || Q (y)) - 7 ) + - E [exp(z P || Q (y)) - 7 ] (412) 

= 2Ej(P\\Q) + 7 — 1. (413) 

Eqs. ( |410[ ) and ( |41 1 [ ) follow from ( |409| ), and the identities 

max{x'i, x' 2 } = 2 [x\ + X 2 + |mi — m 2 1], (414) 

min{xi,x 2 }= \[x\ + x 2 - \x\ - xf\\ (415) 

for all si,s 2 S R. ■ 

Remark 39: In view of ( 68 ), it follows that ( 409 > and ( 410 ) are specialized respectively to ( 243 > and |49 
( 20 )] by letting 7 = 1 . 
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B. An Integral Representation of f -divergences 
In this subsection we show that 

{(Ery(P\\Q),E~ / (Q\\P)),'y > l} 

uniquely determines D(P\\Q), JP a (P\\Q), as well as any other /-divergence with twice differentiable /. 

Proposition 3: Let P <CS> Q, and let /: (0, 00 ) -> R be convex and twice differentiable with /(1) = 0. 
Then, 

/ OO 

(/"( 7 ) E y (P\\Q) + 7 - 3 /"( 7 -i) E^QWP)) d 7 . (416) 

Proof: From ll66l Theorem 11], if /: (0, 00 ) — > R is a convex function with /(1) = 0, therp^j 

D f (P\\Q)= f 1 l p (P\\Q) dF f (p) (417) 

Jo 

where T/ is the rr-finite measure defined on Borel subsets of (0,1) by 

fP 2 1 

r /((Pi.P 2 ]) = / -dQf(p) (418) 

for the non-decreasing function 

= pe(0,1) (419) 

where f + denotes the right derivative of /. 

The DeGroot statistical information in ( [69] ) has the following operational role OHl . which is used in this 
proof. Assume hypotheses Hq and H\ have a-priori probabilities p and 1 — p, respectively, and let P and Q 
be the conditional probability measures of an observation Y given Hq or H\. Then, I p (P\\Q) is equal to the 
difference between the minimum error probabilities when the most likely a-priori hypothesis is selected, and 
when the most likely a posteriori hypothesis is selected. This measure therefore quantifies the value of the 
observations for the task of discriminating between the hypotheses. From the operational role of this measure, 
it follows that if P <C2> Q 


Z p {P\\Q)=Z 1 - p (Q\\P). 


The divergence and DeGroot statistical information are related by 


MP\\Q) 


' pE^(P\\Q), 

< 

(l-p)E^(Q\\P), 

1 — p 


pe ( 0 , 4 ] 

pe [4,i). 


(420) 


(421) 


4 See also (75] Theorem 1] for an earlier representation of /-divergence as an averaged DeGroot statistical information. 
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The expression for p e (0, |] follows from the fact that the functions that yield E7 and l p in ( |67j ) and ( |7Q| ), 
respectively, satisfy 


4> P — pfi=? • 

v 

The remainder of ( 421 ) follows in view of ( |420[ ). 

Specializing ( |417| ) to a twice differentiable / gives 

D f (P\\Q)= ( l Z p ( P \\Q).Lf"( l -P 
Jo P \ 

1 

1 


(422) 


= / Z p (P\\Q).^f" 


po 


p 
1 -p 


dp 


P 


1 


d P +J i MP\\Q)-- 3 r 


i —p 
p 


dp 


= E m {P\\Q) ■ 3 /" 


p- 


1 -P 
P 


d P + C E^{Q\\P) l -^Pf"( l P 

1 1 i ~p p° 


P 


dp 


EJP\\Q)f"(i)d 1 + / 1 E^(Q\\P)f"{ 1 )d 1 


/"(7) Ey(P\\Q) + 7 _3 /"(7" 1 ) ^ 7 (Q||P) 


d7 


(423) 

(424) 

(425) 

(426) 

(427) 


where ( |423| ) follows from ( |4 17 [ )— ( |4T9l ) ; ( |424[ ) follows by splitting the interval of integration into two parts; ( |425| ) 


follows from ( |421| ); ( |426| ) follows by the substitution 7 = - l ^T, and ( |427| ) follows by changing the variable of 
integration t = 4 in the second integral in ( 426 1 . ■ 

Particularizing Proposition [3] to the most salient /-divergences we obtain (cf. ll 66 l (84)—(86)] for alternative 
integral representations as a function of DeGroot statistical information) 


D{P\\Q) = loge j {p- 1 E^PWQ) +'y~ 2 E^QWP)) d 7 , 

/ OO 

(7 a - 2 J E 7 (P||Q) + 7■“■ 1 ^7(QI|4 , )) d 7 , 

and specializing ( |429[ ) to a = 2 yields 

roo 

x 2 (P\\Q ) = 2 / (£ 7 (P||Q) + 7“ 3 P 7 (QI|P)) d 7 . 


(428) 

(429) 

(430) 


Accordingly, bounds on the P 7 divergence, such as those presented in Section VII-C directly translate into 
bounds on other important /-divergences. 

Remark 40: Proposition [3] can be derived also from the integral representation of /-divergences in [ 18 
Corollary 3.7]. 
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C. Extension of Pinsker’s Inequality to Divergence 

This subsection upper bounds E 7 divergence in terms of the relative entropy. 
Theorem 32: 


£ 7 (P||Q)log 7 < D(P\\Q) + 2e _i loge, (431) 

2E*(P\\Q) loge < D(P\\Q). (432) 

Proof: The bound in ( |43 1 ) appears in [ 68 , Proposition 13]. For 7 = 1, ( 432 1 reduces to 0 - Since is 
monotonically decreasing in 7 , (432]) also holds for 7 > 1. ■ 

For 7 = 1, ( |432| ) becomes Pinskers inequality ([]]), for which there is no tighter constant. Moreover, in view of 
(68 1 , for small E\{P\\Q), the minimum achievable D(P\\Q) is indeed quadratic in E\{P\\Q) lt39l . This ceases 
to be the case for 7 > 1, in which case it is possible to upper bound E^(P\\Q) as a constant times D(P\\Q). 
Theorem 33: For every 7 > 1. 


E^P\\Q) 

sup = C7 


(433) 


D(P\\Q) 7 

where the supremum is over P -C Q, P f (f and c 7 is a universal function (independent of P and Q), given 
by 

f 7 -7 


Cry - 


r(h 


t'y = — 7 W-i ( —7 e 


1 


(434) 


(435) 


with r in ( |434| ) is given in ( |43] ), and W -1 in ( |435[ ) denotes the secondai'y real branch of the Lambert W function 

m. 

Proof: The functions f-f t) = (t — y) + and r (see ( [43] )) satisfy the sufficient conditions of Theorem [I] 
Their ratio is 

_ ) W> te [ 7 ’°°) 

0 t G ( 0 , 7 ]. 


AC 7 (f) = 


(436) 


For t > 7 


n'Jt) = 


7 log t + (1 - t) loge 


(437) 


7 r 2 (t) 

Since 7 > 1, it follows from ( |437| ) that there exists t 1 E (7,00) such that ac 7 is monotonically increasing 
on [7, i 7 ], and it is monotonically decreasing on \p. oc). The value f 7 is the unique solution of the equation 
« 7 (i) = 0 in (7,00). From ( |437| ), f 7 E (7,00) solves the equation 

7 logf = (t — 1) loge (438) 
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which, after exponentiating both sides of (4381 and making the substitution x = — gives 


X 1- 

xe = — ^ e t . 


(439) 


The trivial solution of ( 439 > x = ~* corresponds to t = 1, which is an improper solution of (14381) since t < 7. 


The proper solution of (439) is its second real solution given by 


x = W- i|-ie 


1 


consequently, t = —72: and ( |440[ ) give ( |435| ). In conclusion, for t > 0 and 7 > 1, 

0 < Ky(t) < K~/(ty) = C 7 


where the equality in ( |441[ ) follows from ( |434| ), ( |436[ ), and f 7 > 7. Theorem 1 it yields 

E ii p \\Q) < <*D(P\\Q). 


(440) 


(441) 


(442) 


To show ( |433| ), or in other words that there is no better constant in ( 442 ) than c 7 , it is enough to restrict to 
binary alphabets: Let A = {0,1}, e E (0,1), and fT(0) = e, Q e ( 0) = f-. Since f 7 > 1, we have 

Pe( 1) 1 - £ 


and 


Q e (l) I-* 
Pe{ 1) 


< 1 < 7 


= 0 


where ( |444| ) follows from ( |436[ ) and ( |443[ ). We show in the following that ' e \ 

(from below) to c 7 by choosing a sufficiently small e > 0. To that end, for all e E (0,1), 


(443) 


(444) 


can come arbitrarily close 


E^PeWQs) = Qei 0) f- 


Pe{ 0) 


= Qe(0) r 


7 1 Qei 0) 

Pei 01 

Qei 0) 


+ Qe(l) /' 


7 


K. 


= Q e (0) r ( ) k. 


Pei 0) 
7 1 Qe{0) 
Pei 0) 


n(l) 

Q e (!) 


+ Qe(l) r 


A(l) 

Qe(l) 


A(l) 

Qe(l) 


= c 7 Q e (0) r ^ 


QMJ ' n \Qei 0) 
Pei 0 ) 


- C 7 , 


Q e (0) 

D(P e \\Q e ) - Q e (l) r 


Pei 1) 
Qe(l) 


(445) 

(446) 

(447) 

(448) 

(449) 
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where ( |445[ ) holds due to ( |66| ); ( |446[ ) follows from the definition of ac 7 as the continuous extension of — with r 


in ( |43] ); ( |447[ ) holds due to ( |444[ ); ( |448[ ) follows from ( |434[ ), ( |436[ ); ( |449[ ) follows from ( |42| ) which implies that 

A(l) 


D r {P £ \\Q £ ) = D{P £ \\Q £ ). From 


1 - 


1 


£>(P e ||Q e ) VQ e (l) 


< 




< c_. 


Appendix |H] shows that 


lim 


1 


which implies from (450 > that 


Pe{ 1 ) 

T-To{d(p £ \\q £ ) ' VQ e (i) 


E,(P e \\Q e ) = 
e^oD(P £ \\Q £ ) - 


= 0 , 


(450) 


(451) 


(452) 


Remark 41: The value of c 7 given in < |434[ ) can be approximated by 

5 


5 = 


( 5 + 7) log + log e 

a l 7 -1 ivm 

a = 1.1791 


log e ’ 

with a relative error of less than 1% for all 7 > 1, and no more than 10 -3 for 7 > 2. 


(453) 

(454) 



Fig. 3. The coefficient in ( |434[ > (solid line) compared to (cf. ( |431| >) (dashed line). 


33 


is tighter than (4311 since c 7 < 


log 7 


for 7 > 1, and the 


It can be verified that the bound in Theorem 
additional positive summand in the right side of ( |43 1 [ ) further loosens the bound ( |43 11 ) in comparison to 
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( |433[ ). According to the approximation of c 7 in ( |453[ ) and ( |454[ ), we have for large values of 7 

1 


log 


a'y log 7 
eloge 


(455) 


Remark 42: The impossibility of a general lower bound on E 1 (P\\Q), for 7 > f, in terms of the relative 
entropy D(P\\Q) is evident from Example [j~3j 

Remark 43: The fact that {c 7 } 7 >i in ( |434[ ) is monotonically decreasing in 7 (see Figure [3]) is consistent with 
( 433 1 and the fact that the E 1 divergence is monotonically decreasing in 7 . 

Remark 44: The fact that the behavior of D(P\\Q) for small \P — Q\ is quadratic rather than linear does not 
contradict Theorem 33 because lirn 7 ^ 1 c 7 = +00 (see Figure [3]). 

In view of ([400 ) and ([433 ) we obtain 
Corollary 4: If P <C Q, 7 > 1 and T G then 


P(P)<'yQ(P) + c^D(P\\Q). 

Corollary 5: If P <C Q and 7 > 1, then 

E 1 (P\\Q) < min {f \P - Q\ + c-y D({1 - X)P + XQ || Q)} . 
AG [ 0 , 1 ] 

Proof: For A G [0,1], let R = (1 — A )P + XQ. Then, we have for 7 > 1, 

E-y(P\\Q) <^\P-R\ + E^RWQ) 

= X\P-Q\ + £? 7 ((1 - A )P + XQ HQ) 

< | \P - Q\ + c 7 £)((! - A )P + XQ || Q) 


(456) 

(457) 

(458) 

(459) 

(460) 


where ( |458[ ) is ( |405| ); and ( |460| ) follows from ( |433[ ). ■ 

Remark 45: Note that the upper bounds in the right sides of ( 408 ) and ( 442 1 follow from ( 457[ ) by setting 
A = 1 or A = 0, respectively. 

Remark 46: Further upper bounding the right side of ( 460| ) by invoking Pinsker’s inequality and the convexity 
of relative entropy, followed by an optimization over the free parameter A G [0,1], does not lead to an 


improvement beyond the minimum of the bounds in (432 1 and (442). 


D. Lower Bound on Fpng as a Function of D(P\\Q) 

The i ? 7 divergence proves to be instrumental in the proof of the following bound on the complementary 
relative information spectrum for positive arguments. 
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Theorem 34: If P <C Q, P / Q, and (3 > 1, then 

1 - F P||Q(log/3) 


D(P\\Q) 


< u(/3) = min 


/3c 7 


76(1,/9) V/3-7/ ’ 

where c 7 is given in (434>. Furthermore, the function u : (l,oo) —y M + is monotonically decreasing with 


(461) 


<P)< 


lo g( 4 


, V/3 > 2e. 


Proof: For (3 > 1, denote the event 

= {x G A: *p||q(x) > log/?} 

which satisfies 


P{Fp) > (3 Q(Fp). 


Then, 


1 - F P||Q( lo g/3) = P {Pfi) 


< inf 

76(1, 0) 

< inf 




i-i 

1 p 


PE^PWQ) 


76(1, 0) 0 - 7 

/3c 7 


< inf 




(462) 


(463) 

(464) 

(465) 

(466) 

(467) 

(468) 


76(1,/3) V/3 — 7 

where ( 465 ) holds by Definition [2j ( |466 1 follows from ( 464| ); ( |467 ) is satisfied by ( |400 1, and ( 4681 is due to 


(4331. Note that the infimum in 


is attained because c 7 is continuous and for /3 > 1, tends to +oo at 
both extremes of the interval (1, /3). The monotonicity of u{j3) and the bound in ( |462| ) are proved in Appendix]!] 


VIII. Renyi Divergence 

The Renyi divergence (Definition [4]) admits a variational representation in terms of the relative entropy |94l 
Theorem 1], Let T\ -C P 0 then, for a > 0, 

(l-a)D a (P 1 \\P 0 )= min [a D(P\\P 1 ) + (1 — a) D(P\\P 0 )}. (469) 

P <Pl 

In this section, integral expressions for the Renyi divergence are derived in terms of the relative information 
spectrum (Definition |2]). These expressions are used to obtain bounds on the Renyi divergence as a function of 
the variational distance under the assumption of bounded relative information. 
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A. Expressions in Terms of the Relative Information Spectrum 

To state the results in this section, it is convenient to introduce ( a : (0, oo) —> [0, oo) 

CM = M (1 - F P | |Q (log/3)) . (470) 

The Renyi divergence admits the following representation in terms of the relative information spectrum and the 
relative information bounds E [0, l ] 2 in d 149[ >—( p~50 ). 

Theorem 35: Let P<Q. 

• If /3i > 0 and a E (0,1) U (l,oo), then 


D a (P\\Q) = ^ log(tf- + (1 - a) j ^ (r~ 2 - CM)W 

If 0\ = 0 and a E (0,1), then 


D a (P\\Q) = lo § (o - «) /°°(/3 a ' 2 - Ca(/3)) . 


If a E (1, oo), then 


D a (P\\Q) = - t log (a-1) 

a — 1 


a — 


CM d/3 

log [p 2 ~ l + (a - 1) J CM d/^ 


Proof: If a > 1 
1 


a — 1 


log (IE [exp((a - l)z P | 


implies that D a (P\\Q) is given by 

1 


a — 1 

1 

a — 1 


log 

log 


[exp((a - 1) *p||qP0) > t] dt 
log t 


1 p\\q{ x ) > 


a — 1 


dt 


(471) 

(472) 

(473) 

(474) 

(475) 

(476) 


where (|475[) follows from (|258[) for an arbitrary non-negative random variable V, and we use o: > 1 to write 


( |476[ ). Then, (J473J) holds by the definition of the relative information spectrum in <(27 ) and by changing the 
integration variable t = /3 Q_1 . If 3\ > 0, the integrand in the right side of ( 473 1 is zero in [3^ 1 , oc) and the 
expression in ( |471| ) is readily verified (for a > 1). More generally (without requiring 3\ > 0), we split the 
integral in the right side of ( |473| ) into [0 ,@ 2 ) U [^ 2 - oc), and ([474]) follows since the integral over the leftmost 
interval is /3 “ _1 considering that Fp||g(log/3) = 0 therein. 
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If a G (0,1), we write D a (P\\Q) as 
1 


a — 


- log(E[exp((a - 1 )i P 


j- log ^ P [exp((a - 1 ) i P \\ Q (X)) >t)dt) (477) 


——7 log 
a - 1 \./q 


1 p\\q( x ) < 


log t 


d t 


(478) 


1 


a — 1 
1 


a — l 

log (J IP [*p||qP0 < log/?] (a — 1) f3 a ~ 2 d/3 
a _J ^ ((1 - a )/°° f p||Q( 1o S d ^) ( 479 ) 

which is the expression in ( |472| ). If /3j > 0, then we can further split the integral in the right side of ( 479 1 into 
the intervals [/ 32 ,/ 3 f 1 )U [/f ^ 1 . oc). Over the rightmost interval, Fpng (log 7 ) = 1 and the integral is seen to be 
P\~ a , thereby verifying \A1\\ for a G (0,1). ■ 


The close relationship between the Renyi and Hellinger divergences in ([80]) results is the following integral 
representations for the Hellinger divergence. 

Corollary 6: Let P <C Q. 

• If /?i > 0 and a G (0,1) U (1, oo), then 

JPa(P\\Q) = X (/3 “- 2 - C«(/3))d/3. 

a ~ 1 

• If ti\ = 0 and a G (0,1), then 

1 


J?*(P\\Q) = 


1 — a 


(/?°- 2 -Ca(/?))d/3. 


If a G (1, oo), then 


■*r a (P\\Q)= [°° UP) d/3-- 

Jo « - 1 


pu 1 -1 


- 1 

T- + / UP) 

1 7/u 


d/3. 


« - 1 J P2 

Proof: Combining ( f80[ ) with ( 471 ), ( 472 >, ( 473 1, ( 474[ ) yields ( |480 >-( 483| ), respectively. 
Particularizing ( |470[ ), ( |473[ ) and ( |482[ ) to a = 2, we obtain 


D 2 (P\\Q) = log / (1 - F P || Q (log/3)) d/3 , 


X 2 (P\\Q) = / (l — Fp||g(log/3)) d/3—1 
Jo 

/ oo f‘l 

(l — Fp||g(log /3)) d/3 - j Fp||g(log/3) d/3. 


(480) 


(481) 

(482) 

(483) 


(484) 

(485) 

(486) 
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Note the resemblance of the integral expressions in ( |256[ ) and ( |486[ ) for D(P\\Q) and x 2 (-P|| Q), respectively. 

We conclude this subsection by proving three properties of the Hellinger divergence as a function of its order. 
The first two mono tonicity properties are analogous to If37l Theorems 3 and 16] for the Renyi divergence; these 
monotonicity properties have been originally stated in Il65l Proposition 2.7], though the following alternative 
proof is more transparent. 

Theorem 36: The Hellinger divergence satisfies the following properties: 

a) JfT a (P\\Q) is monotonically increasing in a G (0, oo); 

b) — l) 3tf a (P\\Q) is monotonically decreasing in a G (0,1); 

c) - J^ a (P\\Q) is log-convex in a G (0, oo), which implies that for every a, (3 > 0 




Proof: 

a) From (|39|) and ([52]), we have 


(a + /3) r - 
4 a/3 


J? a (P\\Q)Jfy(P\\Q). 


(487) 


JP a {P\\Q) = D fa {P\\Q) 


(488) 


with 


whose derivative is 


... t a — ait — 1 ) — 1 

/a t = -“ I “-, t > 0, 

a — 1 


JL f ( t ) = tr{ta 1} > o 

da Ia[ ) (a- l) 2 loge 


(489) 


(490) 


where the function r: (0,oo) —> R is defined in ( |43] ). Since it is strictly positive except at t = 1, f a is 
monotonically increasing in a G (0,cx)). Hence, Part a) follows from 
b) From ( |488| ), for a G (0,1), we have 


^-1) JP a (P\\Q) = Dg a (P\\Q) 


where g a : (0, oo) 


with derivative 


is the convex function 


g a (t) = t - 1 - 


t a - 1 


a 


— u\ - r ( r ) 
da 9a[ > a 2 loge 


t > 0 . 


< 0 , 


(491) 


(492) 


(493) 


so g a is monotonically decreasing in a G (0, f). Hence, Part b) follows from (4911. 
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c) To prove the log-convexity of x a Jrf? a (P\\Q) in a £ (0,oo), we rely on [95] Theorem 2.1] which states that 
if IF is a non-negative random variable, then X a in ( |293[ ) is log-convex in a. The claim now follows from 
( |293| ) by setting W = ^ (Y) with Y ~ Q, which yields that = A J'PaiPWQ) loge for a £ (0,oo). 


B. Bounds as a Function of the Total Variation Distance 

Just as with Pinsker’s inequality, for any e £ (0,2], the minimum value of D a (P\\Q) compatible with 
\P ~ Q\ > £, is achieved with distributions on a binary alphabet |92l Proposition 1]: 


min DJPllQ) = min d a (p\\q) 

P,Q- \P~Q\>e p,q : |p-ff|>f 

where the binary order-a Renyi divergence is defined as 

^-log+ (1 ~p) a (l ~ if a f 


(494) 


d a (p\\q) = 


(495) 


P log® + (1 -p) log ^2, 


if a = 1. 


We proceed to use Theorem 35 to get an upper bound on D a (P\\Q) expressed in terms of \P — Q\. 
Theorem 37: If /3\ £ (0,1) and a £ (0,1) U (1, oo), then 


D a (P\\Q) < 


1 


a — 1 


log 1 + 


\P-Q\ P i-i 
2 1 -/ 9 ! 


(496) 


Proof: Regardless of whether a < 1 or a > 1, we can only get an upper bound if, in view of ( |470[ ), in the 
integral in (|471[) we drop the interval [if, 1: 


1 


D a (P\\Q) < -P— log ( p\~ a + (!-«)/ (P-' z F P || Q (log/9) d/3 


rfc 


cx—2 


a — 1 
1 


rt3P 


log ^1 - (1 - a) jf p a 2 (1 - F P || Q (log/3)) d/3 
log(l + (a-l)<JE[W“]) 


cr — 1 

1 


(497) 


where ( |497[ ) holds with 5 

[l^r 1 ]: 


j|P — Q| and W ~ pi where /q is the probability density function supported on 


pm = ^{l-V PllQ (\ogP)). (498) 

Note that p\ is indeed a probability density function due to ( |25 1 [ ). In order to proceed, we derive an upper 
bound on E[W“] expressed in terms of \P — Q\ by invoking Lemma [5] in Appendix [TJ To that end, denote the 
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monotonically increasing and non-negative function g{x) = x a for x > 0, and let p 2 be the probability density 
function supported on [1,/Tj -1 ]: 

1 1 


MP) = 


(499) 


l-Pi P 2 ' 

Note that, on their support, /3 2 pi(/3) is monotonically decreasing while i^p-Aff) is constant. Therefore, we can 
apply Lemma [5] to W ~ p\ and V ~ p 2 to obtain 

P\~ a - 1 


E[^]<E[V «]- (1 _ M(a _ iy 

which gives the desired result upon substituting in ( |497[ ). 

Corollary 7: If (3\ E (0,1) and a E (0,1) U (1, oo), then 

P\~ a ~ 1 


(500) 


jr*(p\\Q) < 


2(a-l)(l-/3i) 

Proof: Combining ( |80| ) and ( |496[ ) yields ( |501[ ). 

Particularizing ( |501[ ) to a = 2 yields 

x 2 (P||Q)<^r 1 |^-QI 

which improves the bound in d 171 1 if either P\ or 82 = 0. 

The combination of (JT|), ([4]) and ( 502[ ) yields the following bound: 
Corollary 8 : If /3i > 0, then 


p-QV 


(501) 


(502) 


X' 


\ P \\Q) < 1 - exp(-D{P\\Q)) 


(503) 


Example 14: Let P, Q be defined on A = {0,1} with P{ 0) = Q( 1) = which implies that j3\ = 
Then X 2 {P\\Q) = 97.01, and the bound in ( 503[ ) is equal to 98.45 in contrast to the upper bound in ( |181[ ) whose 
value is 121.17. 

Remark 47: By letting a -A 00 in ( |496 ), we obtain D 00 (P\\Q) < log^-, which shows that the bound in 
( |496[ ) is asymptotically tight (cf. ©)• 

Remark 48: By letting a —> 1 in ( |496[ ), we get ( 344 1 . Therefore, Theorem [37] generalizes tfl091 Theorem 7]. 


Remark 49: By letting a —> 0, it follows from (j496J) that 

D 0 (P\\Q) < log 


i~\\p-Q\r 


a bound which, in view of ( |77] ), is achieved with equality in the case of a finite alphabet with 

P(a) = 


Q( a ) a E T 
!-5 > a ^ 


(504) 


(505) 


aeP c 
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with the event T selected to satisfy Q(T) = 1 — 6. 

Remark 50: Another upper bound on the Renyi divergence can be obtained by the simpler bound 

E[bH < /?r“, 


which holds because W G [1, P ] ]. Combining ( |497 ) and ( 506[ ) yields 


1 


D a (P\\Q) < -- log(l + («-l)^r Q ). 

a — 1 


(506) 


(507) 


Note that, in the limit a —> 0, the bounds in ([496 ) and ( |507[ ) coincide and are equal to the tight bound 

-log(l - 5). 

Remark 51: Alternatively, we have the bound 


D a (P\\Q) < 


1 


log ( (1 — A) 1 “ + 6(a - 1) 


a — l 


'it /3 Q_1 

4, 


d P 


obtained from (497 1 and 


E[iy°] < (1 ~ 1 + [* 

5(a - 1) J_L_ (3 - 1 


(508) 


(509) 


which holds since, in view of (4981 and (273]), 


Pi(P) < < 


5f) 2 ’ P e 1 - 5 ] 
/3Q3—1) > P e [t=7 ’ Pi ] 


(510) 


0 otherwise. 

The upper bounds in ( |496[ ) and ( |508| ) asymptotically coincide in the limit where a -A 00 , giving the common 


limit of log ( jr ) which is a tight upper bound (cf. Remark 


47). 


C. Bounds as a Function of the Relative Entropy 

In this section, we provide upper and lower bounds on the Renyi divergence D a {P\\Q), for an arbitrary order 
a G (0,1) U (1, 00 ), expressed in terms of the relative entropy D(P\\Q) and Pi, Pi- 
Theorem 38: Let (Pi, Pf) G [0, l) 2 , a G (0,1) U (l,oo), and u a - [0, 00 ] —> [0, 00 ] be 

a — 1 


u n = 


K a (t) 


(511) 


with K a defined in (176). 


December 6, 2016 


DRAFT 






























SASON AND VERDU: /-DIVERGENCE INEQUALITIES 


72 


a) If a £ (0,1), then 


+u a ^ 1 )D(P\\Q)) 

< D a (P\\Q) 

1 


< min 


D(P\\Q), — log(l + u Q (/? 2 )£(P||Q))' 


(512) 

(513) 


b) If a £ (1, oo), then 


max j.D(P||Q), —- log(l + u a (ft) D(P\\Q)j 
< D a (P\\Q) 


< min < log 


1 


1 


a — 

c) Furthermore, if a £ (0,1) U (l,oo), then 


Y \og(l+ u a {Pi 1 )D{P\\Q)) 


D a (P\\Q) < 


1 


a — 1 


S(Pl~ a - 1 ) 


log 1 + - 


1-/3! 


where 


?2 _ \D{P\\Q) 


5 Z = min ■ 


, 1 - exp(-D(P\\Q)) 


(514) 

(515) 

(516) 

(517) 


[ 2 log e 

Proof: Parts [a]) and [b]) follow from Theorem[9]in view of (80i, from the fact that D a {P\\Q) is monotonically 
increasing in a > 0, and from D oa (P\\Q) = log 

Part|c| follows from Theorem [37] replacing 5 = \ \P — Q\ by its upper bound <5 obtained from ([TJ and (|4]). ■ 

The next three remarks address the tightness of the bounds ( |5 12 )—( 513 ) and (514 M 5151. 

Remark 52: The constants u a (/3f and Uni ff in ( |512[ )-( |5T3] ) and ( |514[ )-( [5T5] ) are the best possible among 
all probability measures P , Q with given (B \, 02) £ [0, l) 2 . This follows from (80), and in view of the tightness 
of the constants in Theorem [9] (Remark [TT}. 

Remark 53: Let P, Q = Q e be defined on a binary alphabet with P( 0) = ^ and Q(0) = 1 — e. Then, it is 
easy to verify that the ratio of the upper to lower bounds in Parts [a]) and |b| converges to 1 as e —> 0. 

Remark 54: Let P and Q be defined on a binary alphabet with -P(O) = Q(0) = e e (0,1). Then, in the 

limit e -A 0, the ratio of D a (P\\Q) to the left side of (|512|> is equal to -— ? 2 , £ (1, log e (4)) for a £ (0,1). 
Moreover, if e -A 0, the ratio of D a (P\\Q) and the right side of ( |515[ ) tends to 1 for a £ (1, oo). 

To prove the first part of Remark [54] in view of ( |512[ ), one needs to show that for a £ (0,1) 


(a — 1) d a (i||e) a 

hm- - -^ =- r 

£ 4° log(l + u Q (±) d(l\\e)j log 2 (2 


2 

2-a 


(518) 
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where d(-||-) = c?i(-1|■) and d Q (-||-) are given in ( |495[ ). This can be verified by using ( |495[ ) and ( |51 1[ ) to show 
that if a E (0,1), then in the limit e -A 0 


d (\\\ £ ) = l (i + ol 1 )) lo s(i)> 

(a — 1) d a (l ||e) = — a(l + o(l)) log 2, 
a(l + o(l)) 


«(£)=“ 




(519) 

(520) 

(521) 


Assembling ( |519[ )-( |52T] ) yields ( |518[ ) whose right side is monotonically decreasing in a G (0,1), and bounded 
between 1 (by letting a — > 1) and log e (4) (by letting a — > 0). 

To prove the second part of Remark |54j in view of ( |515| ), one needs to show that for a G (l,oo) 


lim 


(a - 1) d a (^\\s) 


40 log(l + u a (^)d(^||e)) 


= 1 . 


From ( |495[ ) and ( |511| ), it follows that in the limit where e tends to zero 

(a - l)d a (l||e) = log (2 _a e 1_Q ) (1 + o(l)), 


l + o(l) 


(2E)"- 1 log(^) ' 
Assembling ( |519| ), ( |523| ) and ( |524[ ) yields ( |522[ ) for a G (l,oo). 


(522) 

(523) 

(524) 



Fig. 4. The Renyi divergence D a (P\\Q) for A = {0,1} with P(0) = Q( 1) = 0.65, compared to the tightest upper bound in 


Theorem [38] (a) l(5T3j for a G (0,1); (b): (515} for a £ [1, 2.57]; (c): (516} for a > 2.57. 


Example 15: Figure [4] illustrates the upper bounds on D a (P\\Q ) in Theorem 38 for binary alphabets. 


December 6, 2016 


DRAFT 





































SASON AND VERDU: /-DIVERGENCE INEQUALITIES 


74 


Remark 55: 11401 Proposition 11] shows an upper bound on D a (P\\Q) for a E [1, |], which is expressed in 
terms of D(P\\Q) and the finite cardinalities of the alphabets over which P and Q are defined. Although the 
bound in ROl (9)] is not tight, it leads to a strong converse for a certain class of discrete memoryless networks. 


IX. Summary 


Since many distance measures of interest fall under the common paradigm of an /-divergence, it is not surpris¬ 
ing that bounds on the ratios of various /-divergences are useful in many instances such as proving convergence 
of probability measures according to various metrics, analysis of rates of convergence and concentration of 
measure bounds lfl3l . IBP . f76l . lf82l . |l89ll , HI 1 111 , hypothesis testing lOTl . testing goodness of fit 11501 . Il85ll . 
minimax risk in estimation and modeling B71 . 0T1 . ll86l . Il07t . strong data processing inequality constants and 
maximal correlation Jl|, l lETl . | |84l . transportation-cost inequalities | [T3l l. | [74l . ll82l . li83l . contiguity l64l . lf65ll . 
etc. 

While the derivation of /-divergence inequalities has received considerable attention in the literature, the 
proof techniques have been tailored to the specific instances. In contrast, we have proposed several systematic 
approaches to the derivation of /-divergence inequalities. Introduced in Section III-A| functional domination 
emerges as a basic tool to obtain /-divergence inequalities. Another basic tool that capitalizes on many cases of 


interest (including the finite alphabet one) is introduced in Section IV-B where not only one of the distributions 
is absolutely continuous with respect to the other but their relative information is almost surely bounded. 


Section V-D illustrates the use of moment inequalities and the log-convexity property, while the utility of 


Lipschitz constraints in deriving bounds is highlighted in Section VI-B 


In addition, new /-divergence inequalities (frequently with optimal constants) arise from: 


integral representation of /-divergences, expressed in terms of the E y divergence (Section VII-B i; 
extension of Pinsker’s inequality to £/ divergence (Section VII-Ci; 


a relation between the relative information and the relative entropy (Section VII-Di; 


exact expressions of Renyi divergence in terms of the relative information spectrum (Section VIII-Ai; 
the exact locus of the entropy and the variational distance from the equiprobable probability mass function 


(Section VI-Di. 


Appendix A 

Completion of the Proof of Theorem [7] 


Lemma 2: The function k: (0, oo) —> (0, oo) which is the continuous extension of the function in ( 172 ) with 
ac(1 ) = 1 is strictly monotonically increasing. 
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Proof: From (172 1 and (174) 


«'(0 = 


(t — l) 2 log^ e — f log 2 f 
tg 2 (t ) 


if f € (0,1) U (l,oo), while k'( 1) = To show 

(t — l) 2 log 2 e — flog 2 t > 0, Vf E (0,1) U (1, oo). 
we substitute f = exp(x) to obtain that if x 0, then 

s(x) = exp(2x) — (2 + x 2 ) exp(x) + 1 > 0, 
which holds since s(0) = 0 and the derivative 

ti'(x) = 2 exp(x) 

is negative on (—oo,0) and positive on (0, oo). 


x 


exp(x) - ( 1 + x + — 


(525) 


(526) 


(527) 


(528) 


df V 1 - t a + a(t — 1) / > ° 
for (a, i) G 7= ((0,1) U (1, oo)) 2 . From ( |43| ), straightforward calculus gives 

2 d ( r(t) 


[1 -t* + a(t - l)] 2 - 


Appendix B 

Proof of the Monotonicity of K a in ( |176| ) 

To show that the function K a : [0, oo] — > [0, oo] in ( | 176 1 is monotonically increasing on [0, oo] if a € (0,1 
and monotonically decreasing on [0, oo] if a € (l,oo) it is sufficient to show that 

d ( r(t) \ 

w ' - « (529) 

(530) 

(531) 

(532) 

(533) 

(534) 


df \1 -f“ + a(t- 1), 

= (1 — a)(l — t a ) logf — a( 1 — f)(l — f Q_1 ) loge 

- ffa(t) 

so the desired result will follow upon showing 

g a (t) > 0, (a, t)eT. 

Note that g a { 1) = 0. For (a,f) E T, it is easy to verify that 

(1 — a)(t — 1)(1 — f“ _1 ) > 0. 

A division of (533]) by the positive left side of (534) gives the following equivalent inequality: 


a 


4>a(t) > -V (a, f) G J 

a — 1 


(535) 
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where, for a E (0,1) U (1, oo), 


) = 


t-i if t G (0,1) U (l,oo) 


A I (t-!)(l-t“-i) 


a 

a—1 


if t = 1. 


We aim to prove (535). Note that, for a e (0,1) U (l,oo), 


a 


lim cf> a (t) = —— = f a ( 1) 
t->i a—1 


(536) 


(537) 


so, ( |535| ) is implied by proving that 0 „: (0, oo) -A M is monotonically decreasing on (0,1), and it is monoton- 
ically increasing on (l,oo). For this purpose, we rely on the following lemmas. 

Lemma 3: For every t > 0 and a 6 (0,1) U (1, oo) 

1 


4>a(t) = 1p(t) - 


1 — a 




where 


m = 


if t € (0,l)U(l,oo) 


1 if t = 1. 


(538) 


(539) 


Proof: For a, t G (0,1) U (1, oo) 

^(t) - 


1 — a 




l-ax = (f a ~t) lQg e t 
(t- lXt 1 -" - 1) 
(1 t a ) log e t 
~ (f !)(! f a— 1) 

= 0a (f) 


(540) 

(541) 

(542) 


where ( 540 ) holds due to (539), ( 541 1 is justified by multiplying the numerator and denominator of ( 540| ) by 
f Q_1 , and ( 542 1 is due to (536). Note that ( |538[ ) is also satisfied at t = 1 due to the continuity of and 0 at 
this point. ■ 

Lemma 4: The following inequality holds for z > 0: 


'ip'(z) + zf"(z) > 0, 

Proof: From the power series expansion of log e z around z = 1, we get for 0 < z < 2 

fzj = 1 - 5(2 - 1) + 3 <- - l) 2 - j(z - l) 3 + • ■ ■ 

From ( |539[ ) and ( |544| ), f(l) = — ^ and ^"(l) = at z = 1, the left side of ( |543[ ) is equal to 

0'(1) + 0"(1) = 4>O. 


(543) 


(544) 


(545) 
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For z E (0,1) U (l,oo), the left side of (|543[) satisfies 


ip'(z) + zip"(z) = 


m{z) 


where 


z(z — l) 3 

m(z) = (z + 1) log e z — 2 (z — 1), Vz > 0. 


From (547 1 


m(l) = 0, 

m'(z) = -1 — log e >0, Vz E (0,1) U (1, oo) 

which implies that m: (0,oo) —> R is monotonically increasing, positive for z > 1 and negative 
These facts together with ( |545| ) and ( |546[ ) yield that ( |543| ) holds for all z > 0. 


(546) 

(547) 

(548) 

(549) 
for z < 1. 


We proceed now with the proof of ( 535 ). For a E (0,1) U (1, oo), we have 

1 


d d 2 

d 


v>(f) - 


l — a 


t 


1—a\ 




d 


= t~ a logjt) - t~ a —7/>'(f :L ~ Q ) 

oa 

= t~ a log e (f) U'(t l ~ a ) + 1 1 "“ 

where ( 550 > follows from ( 538 1. From Lemma |4j for t > 0. 

i//(t 1 ~ a ) + t 1 ~ a > 0. 


From (554> and (5551, for a E (0,1) U (1, oo), and tE (0,1) 


and for t E (1, oo), 


^(f) < 0, 


^“ (t) > °- 


(550) 

(551) 

(552) 

(553) 

(554) 

(555) 

(556) 

(557) 


From (5381 and the continuity of if)' on (0, oo) (see (5391 and (544)), for all t > 0, 


= lim(^(f) - t ")) = 0. 

04.O 


(558) 
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Combining ( |556| ) and ( |558[ ) gives that, for a E (0,1) and t E (0,1) 

< 0, (559) 

and, combining ( |557 1 and (5581 gives that, for a E (0,1) and t E (f, oo), 

</>' a (t) > 0. (560) 

Hence, for a E (0, f), <f> a : (0, oo) — y R is monotonically decreasing on (0,1), and it is monotonically increasing 

on (1, oo). 


We now consider the case where a E (l,oo). Since, from ( 538 1 and ( 544 >, 

lim 4>' a (t) = lim - t~ a V»'(i 1_a )) 

a —>1 a —>1 

=*<‘> + s 

then, the existence of this limit in ( 561[ ) yields that its one-sided limits are equal, i.e., 


afl 


afl 


(561) 

(562) 

(563) 

(564) 


Consequently, ((559), ([560) and ([564) yield that 


> 0, Vf E (0,1), 

a-ll 

lim cf>' a (t) <0, Vf E (l,oo) 


(565) 


and, from ([556), (557) and ([565), we conclude that ([559) and ( |560[ ) also hold for a E (l,oo). The property 
that 4 > a : (0, oo) — > R is monotonically decreasing on (0,1) and monotonically increasing on (1, oo) is therefore 
extended also to a E (l,oo). As explained after ([537), this implies the satisfiability of ([535). Consequently, 
also ([533) holds, which implies that n a : [0,oo] — > [0, oo], defined in ( | 1 16) , is monotonically increasing for all 
a € (0,1), and it is monotonically decreasing for all a E (l,oo). 
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Appendix C 
Proof of ( |214| ) 


To verify that (214 1 follows from (2061, fix arbitrarily small e > 0 and p > 0. Consider the partition 


A = 4 n) u4 n) u4 ri) with 


APA{a€A:^(a)E{0,l-p]}, 

(a) € (1 — p, 1 + e] 
(a) 6 (1 + e,oo) l , 


A^ — <! a € A 


A n) = <| a G A 


di^ 

dQ 

dP„ 


then 


where 


dQ 

j(n) + j(n) + jH = j 


j(n) 


' dJA 

4 (») dQ 


(a) dQ(a). 


r( n ) 


From the assumption in (2061, R -A 0 when n —> oo since for all sufficiently large n 


Let 


then, from (571 1 , for all sufficiently large n 


Q(4 n) ) = 0. 


dn = Q(A[ n) ) 


1 -d n = Q{A { ? ] ). 


(566) 

(567) 

(568) 

(569) 

(570) 

(571) 

(572) 

(573) 


Consequently, from ( |566[ ), ( |567[ ), ( |569[ ), ( |571[ ), ( |572[ ) and ( |573[ ), it follows that for all sufficiently large n, 

i = i[ n) + 4 n) 


< d n { 1 - p) + (1 - d n ){ 1 + e) = p n - 


(574) 

(575) 


If l i m inf d n = 0 for an arbitrarily small p > 0 then ( 214 1 holds by the definition in ( 512) . Assuming otherwise, 
namely, 


lim inf d n = 8 E (0,1) 


(576) 
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leads to the following contradiction: 

1 < lim inf /j, n 

< (1 — p) lim inf d n + (1 + e) lim sup (1 — d n ) 


— ^(1 _ p) + (1 ~ #)(1 + e) 



where z = '■> < |577[ > follows from < |574| >, < |575| >; < |578| > holds by < |575| >; ( |579[ ) is due to < |576| >. 


(577) 

(578) 

(579) 

(580) 


Appendix D 
Proof of Theorem fl5l 

Eq. ( 243 1 follows from the definitions in (26) and (57). Since z + = ^(\z\ + z) and = ^(\z\ — z), for all 
zeR, (I244I) and d245b follow from (1243b and 


E[l - exp(«p|| Q (r))] = J 1 - d Q = 0- 


(581) 


By change of measure, for every measurable function /: A 

rd P 


E[f(X)]=E[—(Y)f(Y)\ =E[exp(* P || Q (y))/(y)] 


with E[/(X)] < 00 and E [/(!")] < 00 , 

(582) 


Hence, it follows from ( |582| ) that 

P[*P||gP0 > 0] = E[l{.p|, g (X) > 0}] 

= E[exp(z P || O (y))l{*p|| Q (y)>0}] 

and 


(583) 

(584) 


P[»P||0W < 0] — < 0}] (585) 

= E[exp(zp|| Q (y))l{ip||Q(y)<0}]. (586) 

To show ( 246 ), note that from (245) and the change of measure in ( 582 1 , we get 

| \P ~ Q\ = E[(l - exp(*p|| Q (y)))'] (587) 

= E[(exp(*p|| Q (y)) - 1) l{tp|| Q (y) > 0 }] (588) 

= E[(l - exp(— zpiiqPO)) l{i P \\ Q (X) > 0}] (589) 

= E[(l-exp(-zp|| Q (X))) + ]. (590) 
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To show ( |247[ ) and ( |248[ ), we get from ( |245[ ) and ( |584[ ) 

l\P - Q\=E[(l- exp(z P | |Q (y)))~] 

= E (exp(*p|| Q (y)) - 1) l{i PllQ (Y) > 0} 

= IP[ ? 'P||q(^) > 0] - IP[*p||qOO > o] 


(591) 

(592) 

(593) 


where (5931 is (2471, and (2481 is equivalent to (247 1 . 


To show (2491, we use (245) and the notation in (32) in order to write 


E[(l -Z)~] 

(594) 

E[(Z- 1) 1 {Z > 1}] 

(595) 

roo 

/ P[(Z - 1) 1{Z > 1} > p] dp 

Jo 

(596) 

POO 

J P[Z>p]dp 

(597) 

f P[Z<p]dp 

Jo 

(598) 


where (594 1 follows from (|245|) with Z in ((32]); (|596[) exploits the fact that the expectation of a non-negative 


random variable is the integral of its complementary cumulative distribution function; and ( |598| ) is satisfied 
since Z is non-negative with E [Z\ = 1. 

To show ( 250 ), we use ( |246[ ) to write 


\\P ~ Q\ = E[(l — exp(—zp||g(X))) + ] 

poo 

E[(l - exp(—zp||g(9f))) + > p] dp 


Jo 

= [ P[(l - exp(-z P || Q (X))) + > p\ dp 
Jo 

r 1 


10 


*p\\q(X) > lo g 


1 


1 -P. 


dp 


ip\\q( x ) > 


dp. 


(599) 

(600) 
(601) 
(602) 


To prove ( 251 ), a change of variable of integration in ( 250 ), and the fact that Fp||g(log/3) = 1 for P > p ] 


3-1 


December 6, 2016 


DRAFT 































SASON AND VERDU: /-DIVERGENCE INEQUALITIES 


82 


give 


\\P~Q\ = 


1 p\\q(X) > log-j 


d t 


r-l r 


/0 L 


1 - F 


P\\Q 


(log 


1 - Fp|| Q (l 0g/ 9) 


! i P~ 

^r 1 i_F P || Q (log/3) 


/3 2 


d t 

d/3 

d/3 


(603) 

(604) 

(605) 

(606) 


with the convention that (If 1 = oo if /3± = 0. 

Assume that P <CA> Q. To show ( |252[ ) simply note that (2431, the symmetry of the total variation distance, 
and the anti-symmetry of the relative information where iq\\p = ~ 1 p\\q enable to conclude that 


\P ~ Q\ = E[|l - exp(* Q | |P p0)|] (607) 

= E[|l-exp(-* P ||g(X))|]. (608) 

Similarly, switching P and Q in ( |245| ) results in 

2 \P ~ Q\ = E [(l " exp(z O ||pp0))-] (609) 

= E[(l - exp(-z P ||Q(X))) _ ] (610) 


which proves (2531. 


Appendix E 


(332> vs. (344) 


A. Example for the Strengthened Inequality in Theorem 26 


We exemplify the improvement obtained by ( |332[ ), in comparison to ( |344 ), due to the introduction of the 
additional parameter /?2 in ( |150[ ). Note that when @2 is replaced by zero (i.e., no information on the infimum 
of dE j s available or /32 = 0), inequalities ( |332[ ) and ( |344| ) coincide. 

Let P and Q be two probability measures, defined on (A, ■P), P -C Cf and assume that 

dP 


1 - V < -jg (a) < 1 + V, V a G A 


(611) 


for a fixed rj E (0,1). 
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In ( |332[ ), one can replace 13 \ and 02 with lower bounds on these constants. Since f3\ > j j - and fa > 1 — rj 
it follows from ( |332| ) that 

DlPllQ) < I f ( 1 + T ?) 1 °g( 1 + r ?) + (1 - rj) log(l - rj) 

~ 2 y rj rj 

< rj loge • \P — Q\. 


\P-Q\ 


(612) 

(613) 


From (611 1 


|exp(*p||g(a)) - l| < rj, Va6/1 


so, from (243 1 , the total variation distance satisfies (recall that Y ~ Q) 


\P-Q\=E |exp(z P || Q (y)) - l| 


< rj. 


Combining ( |615 I with ( 61 3[ ) yields 


D(P\\Q) < rj 2 loge, Vr?G(0,1). 


For comparison, it follows from ( |344[ ) (see 1 1091 Theorem 7]) that 

D ^ Q ^20t)-' P - Q ' 

< (1 + 7) l°g(l + v) \p_ 0 \ 

2 Tj ^ 

< \ (1 + f?)log(l + rj) 

< \rj(l + ij) loge. 


(614) 

(615) 

(616) 

(617) 

(618) 

(619) 

(620) 


The upper bound on the relative entropy in ( |6 17 i scales like rj, for small rj, whereas the tightened bound in 
( |616[ ) scales like ;/ 2 , which is tight according to Pinsker’s inequality ([TJ. For example, consider the probability 
measures defined on a two-element set A = {a, b} with 

P(a) = Q(b) = \-l, P(b) = Q{a) = \ + l (621) 

Condition ( 611 ) is satisfied for rj k, 0, and Pinsker’s inequality (JTJ yields 

D(P\\Q) > \rj 2 \oge (622) 

so the ratio of the upper and lower bounds in (|616[) and (|622[) is 2, and both provide the true quadratic scaling 


in rj whereas the weaker upper bound in (617) scales linearly in rj for rj « 0. 
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Appendix F 

Derivation of ( |345[ )-( [350l > 


Similarly to the proof of Theorem 26 let X ~ P, Y ~ Q, and Z = exp(zp||g(Y)). We rely on the concavity 
of ip : [0, oo) -A [0, oo), defined to be the continuous extension of , for tightening the upper bound in (340). 


The combination of this tightened bound with ( |337[ ) and (J343J) serves to derive a tighter bound on the relative 
entropy in comparison to ( |332| ). 

Since Z < // 1 , and <p is concave, monotonically increasing and differentiable, we can write 

<f(Z) < ipifr 1 ) - (ft" 1 - Z) < tp{^) (623) 

which improves the upper bound on tp(Z) in (334]). Consequently, from (623), the first summand in the right 


side of ( |337[ ) is upper bounded as follows: 

E[<p(Z) (Z - 1) 1{Z > 1}] <E 


r 1 ) - v'gst 1 ) (/^r 1 - z)) (z - 1) i{z > 1} 
r l )-^(/3 1 - l )/3r l )E[(z-i)i{z>i}] 

+ <p'(Pi 1 )E[Z(Z-l)l{Z> 1}] 

r 1 ) - v'(Pi 1 ) pi 1 ) \p-q\ 


+ ^ / (^r 1 )E[Z(Z-l)l{Z>l}] 


(624) 


(625) 


(626) 


where ( 626 ) follows from (32) and (244]). Combining ([337), (343) and ( 626 ) gives the upper bound on the 
relative entropy in (345). 

The second term in the right side of (626) depends on the distribution of the relative information. To circumvent 
this dependence, we derive upper and lower bounds in terms of /-divergences. 


E [Z(Z - 1) 1 {Z > 1}] = E[(Z- l) 2 1 {Z > 1}] +E[(Z- 1) 1{Z > 1}] 

= E[(Z-1) 2 1{Z> 1}] +\\P-Q\ 


(627) 

(628) 


where ([628) follows from (J244), and consequently the following upper and lower bounds on ( |627[ ) are derived: 

E[Z(Z — 1) 1{Z > 1}] < E [(Z — l) 2 ] +\\P-Q\ (629) 


= x 2 (p\\Q) + \\p-Q\ 


(630) 
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where ( |630[ ) follows from ( |32[ ) and ( |46[ ). Furthermore, from ( |245[ ), ( |334[ ) and ( |627[ ) 

E [Z{Z - 1) 1 {Z > 1}] = E [(Z- l) 2 1 {Z > 1}] + \ \P - Q\ (631) 

= E[(Z-1) 2 ] -E[(Z- l) 2 l{/3 2 < 1}] +\\P-Q\ (632) 

= X 2 (P\\Q) + l \p - Q\ - E[(Z - l) 2 l {/?2 < z < 1 }] (633) 

> X 2 (P\\Q) + \ \P - Q\ - (1 - /3 2 )E[(1 - Z) l{/3 2 < Z < 1}] (634) 

= x 2 (P\\Q) + l\P-Q\-(l- /3 2 )E [(Z - 1)-] (635) 

= X 2 (P\\Q) + ^\P~Q\- (636) 

Combining ([629 1 and ( 636 1 gives the inequality in (3471, and combining ([337 1 , ( 626 > and ( 629 1 gives the upper 
bound on the relative entropy in (350[). 


Appendix G 
Proof of Theorem 1291 

A. Proof of Theorem 29fc ) 

The concavity of the entropy functional implies that given a probability mass function F on a finite set 
{1,..., |-4,|}, and given any subset S C A, H(P) < H(P$) with 

k £ S , 


Ps(k) = 



(637) 


Applying this fact with S given by the indices of the masses below |A| 1 , we conclude that H(P) < H(P) 
with 

Z a£A P(a)HP(a)>\A\- 1 } 
l{P(a)>\A\-'} 

P.„ea t J («)l{7 J («)<|-4| r | > P(IA 
J2 aGA l{P(a)<\A\-A K.^K) 

Moreover, if U is the equiprobable distribution on A, then 


P{k) = 


k: P(k) > |^|- 1 , 

<|A|- 1 . 


(638) 


ip- U| = \P- 


(639) 


Consequently, in order to maximize entropy subject to a given (positive) total variation distance from the 
equiprobable distribution on A, it is enough to restrict attention to distributions whose masses take two distinct 


values only, i.e., of the form ( |378[ ). The only remaining optimization is to determine m a, the number of masses 
larger than \A\ ~ 1 . The requirement that satisfy ( |377[ ) is made so that ( |378[ ) is a valid probability distribution. 
The solution is as given in Part [ab since H(P) = log |^4| — Z?(P||U), and 


L>(P A ||U) = d( 


m a _j_ 


\A\ 


A II mA 

2 II |^| 


(640) 
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B. Proof of Theorem 2 ( ]p ) 

The minimizing probability measure in ( |381 1 is a special case of ll52l Theorem 3], which gives the general 
solution of minimizing the entropy subject to a constraint on the maximal total variation distance from a fixed 
discrete distribution Q (here, Q = U). 


Appendix H 


Proof of (451 i 


Pei 1 ) 


\D(P £ \\Q £ ) ■ 

logf^ + l 1 -^) lo S e 

= lim — 

£—>-0 


= lim 

£—>0 


£ log(f 7 ) + (1 - e) log 

(1-e) log(l - e) - log (l - f -) + e(l-f)loge 


= lim 

£—>-0 


= 0 


£ log(£ 7 ) + (1 — e) log(l - e) - log (l - 

log(l - e) - log(e) + log (l - + hrPf- lo g( e ) + ( x - r) lo g ( 

log(t 7 ) - log(e) - log(l - e) + log (l - log(e) 


(641) 


(642) 


(643) 


(644) 


where (641) follows from (43), and the definition of P £ ,Q £ \ (6431 is due to L’Hopital’s rule; and 


holds since the numerator in ( 643 1 converges to zero as e —> 0 while its denominator converges to log(i 7 ) — 
^1 — log(e) > 0 (recall that t 7 € ( 7 , 00 ), for 7 > 1 , so t 7 > 1 ). 

Appendix I 

Completion of the Proof of Theorem 1341 

Proof of monotonicity and boundedness of ( |46 1 [ ): Substituting 7 = fix into the right side of ( |461| ) gives that, 
for (3 > 1, 

C/3x 


u{/3 ) = min 


1 — x 


(645) 
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The function u in (|645[) is indeed monotonically decreasing on (l,oo) since, if 62 > P\ > 1, 


u(/3i) = min 


c /3ix 
1 — X 

c fi 2 x 

A */ 1 ” 1 


> min 


> min 


xG ( 


(*.0 


c /3 2 x 

1 — X 


= u{p 2 ) 


(646) 

(647) 

(648) 

(649) 


where ( 647| ) holds since c 7 is monotonically decreasing in 7 e (l,oo) (see Theorem 33 1 . 

Proof of ( |462[ ): From (433]) and ( |434[ ), we obtain t 7 > 7 for 7 > 1. Furthermore, since t/r(t ) is monotonically 
decreasing on ( 1 , 00 ), if 7 > e, then 


c„ = ^< 7 


7 


< 


1 


Hence, for (5 > 2e, 


r(t) " r{ 7 ) log e + 7 log 4 " log 4 ' 

tt(/3) = min f Cl 
76 ( 1 , P) P -7 

< 2cg/2 

2 

< 




(650) 

(651) 

(652) 

(653) 


where (652 1 follows by choosing 7 = | in the minimization, and (6531 follows from (650). 


Appendix J 


A Lemma Used for Proving (500) 


Lemma 5: Let g be a monotonically increasing and non-negative function on [a,b\, and let p\ , 77 be probability 
density functions supported on [a,b\. Assume that there exists c <E (a, 6 ) such that 

Pi(P)>P2{P), V/3g [o,c], 

Pi(P) <P2 (/3), V/3e(c,b\. 

Let W ~ pi and V' ~ 77 . then 


(654) 


E[s(W0] <E[g(F)]. 


(655) 
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Proof: The function d = P 2 — pi, defined on [a, b], satisfies 


Consequently, we get 


d((3) <0, Vf3 £ [a, c] 


d((3) >0, V f3 £ [c, b] 
I" d(P) d/3 = 0. 


E[s(v0] ~^[g(w)} 

= [ C d(P)g(/3) d/3+ [ b d(/3)g(P)dl3 

J a Jc 

g{c) f d(P)dp + g(c) f d(/3)d/3 

J a J c 


> 


= 0 


(656) 

(657) 

(658) 


(659) 

(660) 
(661) 


where ( |660[ ) follows from ( |657[ ), (J658J) and the monotonicity of g, and ( |661[ ) is due to ( |658[ ). ■ 
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